noburn.dev
← BlogJoin waitlist
llm observabilitymarket landscapemonitoring2026

The State of LLM Observability in 2026: What Changed and What Still Doesn't Work

Two years into the LLM observability market and the tools have matured. The enforcement gap has not closed. Here is a clear-eyed view of what the category can and cannot do today.

nb
noburn.dev·2026-06-03

LLM observability tools capture what happens when your application calls a model: the prompt, the completion, the token counts, the latency, the cost, and increasingly the eval scores that tell you whether the output was any good. The category exists because raw provider dashboards stop at billing totals. They tell you that you spent $4,200 last month. They do not tell you which prompt, which user, or which retry loop spent it.

Two years in, the category has fragmented because "observability" turned out to be three different jobs. Tracing and debugging multi-step agent runs is one. Cost and usage analytics is another. Evaluation and regression testing of model output is a third. Most tools started in one corner and grew toward the others, but the architectural choice they made early (proxy gateway versus async SDK versus OpenTelemetry collector) still determines what they can and cannot do. The result is a market where no single tool covers the full surface, and where every tool shares one structural limit: they all observe the call after it has already fired.

LangSmith

LangSmith is the observability and evaluation platform from the LangChain team. It records traces of chains, agents, and tool calls with first-class support for the LangChain and LangGraph execution model, which makes it the default choice if your application is already built on those frameworks. The trace view reconstructs nested agent steps, retriever calls, and tool invocations in a tree you can actually read, which is the single thing it does better than almost anything else.

It has moved well beyond LangChain-only instrumentation. The SDK accepts traces from arbitrary code via decorators and an OpenTelemetry endpoint, and the evaluation suite (datasets, LLM-as-judge, pairwise experiments) is now the center of gravity for teams doing prompt regression testing. Pricing is per-seat plus a usage component on traces; the free Developer tier includes a monthly trace allowance and paid plans start in the low tens of dollars per seat per month. Check current pricing before you commit, because the trace-retention and ingestion limits have changed more than once.

What LangSmith does not do is stop a call. It is an after-the-fact record. If a runaway agent loops fifty times, LangSmith will show you all fifty steps in a beautiful trace, billed to you in full.

Helicone

Helicone is the proxy-first option. You change your base URL to route requests through Helicone, and it logs every call with near-zero code change. Because it sits in the request path as a gateway, it can do things async SDKs cannot: response caching, rate limiting, and request retries at the proxy layer. It is open source and self-hostable, which matters for teams that cannot send prompt contents to a third party.

The proxy architecture is also its main tradeoff. Routing production traffic through an external hop adds a network dependency and a latency consideration, and teams running at high volume tend to either self-host the proxy or use the async logging mode, which gives up the inline gateway features. Pricing has a free tier with a monthly log allowance and usage-based paid tiers above it; check current pricing for the request thresholds.

Helicone's rate limiting is the closest thing in the mainstream toolset to enforcement, and it is worth being precise about why it is not the same thing. Rate limiting caps request frequency. It does not estimate the dollar cost of a specific call against a remaining budget and block on that basis. A user can stay under the request-rate limit and still blow through a spend cap with a handful of long-context calls.

Langfuse

Langfuse is the open-source observability project that has become the default self-hosted choice. It does tracing, prompt management, evals, and cost analytics, and the entire stack runs on your own infrastructure with a permissive license. For teams whose compliance posture rules out sending prompts to a vendor, Langfuse and Helicone are usually the two names on the shortlist.

The managed cloud version exists with a free tier and paid plans, but the reason people choose Langfuse is the self-host path. The SDK is async and framework-agnostic, with integrations for the major providers and orchestration libraries. Prompt versioning and the ability to fetch prompts at runtime from the Langfuse server are genuinely useful and underrated features.

As an async SDK, Langfuse observes and does not intervene. It records cost after the fact with good granularity, including per-trace and per-user attribution if you pass the metadata, but that attribution is for reporting. Nothing in the pipeline blocks a call because a user is over budget.

Portkey

Portkey is an AI gateway with observability built on top. Like Helicone it sits in the request path, but its center of gravity is routing: load balancing across providers, automatic fallbacks, retries, and a virtual-key system that lets you manage provider credentials and limits centrally. The observability layer (logs, cost analytics, traces) comes along with the gateway.

The virtual-key and budget-limit features are the most enforcement-adjacent in the mainstream market. You can attach spend limits to a virtual key and have the gateway reject requests once the limit is hit. This is real enforcement at the key level, and for some architectures it is enough. The constraints are that it is tied to routing your traffic through the Portkey gateway, the budget granularity is the virtual key rather than arbitrary per-end-user identity inside a multi-tenant app, and the limit check is a hard cap rather than a pre-flight estimate of the specific call you are about to make. Pricing runs from a free tier to usage-based paid plans; check current pricing.

Arize Phoenix

Arize Phoenix is the open-source observability and evaluation tool from Arize, built natively on OpenTelemetry. Its strength is evaluation and troubleshooting at the trace level, with strong support for retrieval-augmented generation debugging, embedding visualization, and eval workflows. It runs locally or self-hosted, and it plugs into the broader Arize platform if you want managed monitoring at scale.

Because it is OTel-native, Phoenix fits cleanly into teams that already run OpenTelemetry collectors and want LLM spans alongside the rest of their distributed traces. That is also its framing: it is a debugging and eval surface, not a cost-control surface. Cost shows up as an attribute on spans you can analyze, not as a lever you can pull before a call. There is no enforcement mechanism, by design.

Datadog LLM Observability

Datadog added LLM observability as a module inside its existing platform, which is the entire pitch: if your infrastructure, APM, and logs already live in Datadog, your LLM traces can live there too, correlated with everything else. You get spans, token and cost tracking, quality and safety evaluations, and the same alerting and dashboard machinery you already use for the rest of your stack.

The integration story is the reason to choose it and the reason to be cautious. Consolidation is real value. So is the cost: Datadog's LLM Observability is priced as an add-on, typically per-ingested-span or per-session on top of your existing Datadog spend, and it can become a meaningful line item at volume. Check current pricing against your projected trace volume specifically. And like every tool in this list, it is a monitoring layer. It alerts you that spend crossed a threshold. It does not stop the call that crosses it.

The comparison

Pricing below is approximate and changes often; verify current pricing before deciding. "Enforcement" means the tool can block an API call before it fires based on budget, not merely alert or rate-limit after.

ToolStarting pricePre-flight enforcementArchitectureSelf-hostingPrimary job
LangSmithFree tier, then ~low tens $/seat/moNoAsync SDK + OTelEnterprise onlyTracing + evals
HeliconeFree tier, then usage-basedNo (rate limiting only)Proxy gatewayYes (open source)Logging + caching
LangfuseFree tier, then usage-basedNoAsync SDKYes (open source)Tracing + prompt mgmt
PortkeyFree tier, then usage-basedPartial (per-key spend cap)Proxy gatewayLimitedRouting + gateway
Arize PhoenixFree (open source)NoOTel-nativeYes (open source)Eval + RAG debugging
Datadog LLM ObsAdd-on, per-span/sessionNoOTel + agentNoPlatform consolidation
noburn.devFree 50k req/mo, then $9/moYes (client-side, pre-call)Pre-flight SDKCloudCost enforcement

What the category is still missing

Every tool above answers the question "what did this cost?" Almost none of them answer "should I let this call happen at all?" before the call goes out. That is the enforcement gap, and two years of maturation has not closed it because the dominant architectures are structurally incapable of it.

Async SDKs (LangSmith, Langfuse, Phoenix) log after the response returns. By the time they have the cost, the money is spent. Proxy gateways (Helicone, Portkey) are in the path and can reject requests, which is why Portkey's per-key spend cap is the closest the mainstream comes, but their controls operate at the request-rate or virtual-key level, and they require you to route production traffic through an external hop. None of them does the specific thing a cost-control layer actually needs: estimate the token cost of the exact call you are about to make, check it against the remaining budget for that specific user or project, and refuse it client-side before a single token leaves your process.

The gap is sharpest in multi-tenant SaaS. If you resell LLM features to end customers, you do not have one budget, you have one per customer, and you need to stop customer A from spending customer B's allocation in real time. Reporting-grade per-user attribution, which several tools do well, tells you who spent what last week. It does not stop the spend this second. The category has converged on excellent hindsight and left foresight almost entirely unbuilt.

FAQ

Is LLM observability the same as enforcement? No. Observability records and reports what already happened: traces, token counts, cost, eval scores. Enforcement decides whether a call is allowed to happen before it fires. Every major observability tool in 2026 does the first job well, and only a couple do any version of the second, none of them as a pre-flight per-call cost check.

Can't I just set spend alerts and call it enforcement? Alerts notify you after a threshold is crossed, which means the overage has already been billed by the provider. By the time a Slack alert fires that a user hit their cap, the runaway agent has often already made dozens more calls. Alerting is a smoke detector, not a circuit breaker.

Does routing through a proxy gateway solve the enforcement problem? Partly. A gateway like Portkey can reject a request once a virtual key's spend cap is hit, which is genuine enforcement. The limits are that it caps at the key level rather than arbitrary end-user identity, it is a hard cap rather than an estimate of the specific call, and it requires routing all production traffic through an external hop you now depend on for availability.

Do I still need an observability tool if I have enforcement? Usually yes, because they answer different questions. Observability tells you why an agent behaved a certain way and whether output quality regressed; enforcement keeps a budget from being exceeded. Many teams run a tracing tool for debugging and evals alongside a pre-flight layer for cost control, since neither replaces the other.

Which tool should I self-host if I can't send prompts to a vendor? Langfuse and Helicone are the two mainstream open-source options with mature self-host paths, and Arize Phoenix if your priority is eval and RAG debugging on OpenTelemetry. All three are observability tools, so you will still have the enforcement gap to solve separately.

The enforcement gap noburn fills

noburn.dev does the one thing every tool in this article leaves unbuilt: it estimates the token cost of a request client-side, before the API call fires, and blocks the request when the user or project is already over budget. The check happens inside your process, so no token leaves and no overage is billed, rather than being recorded after the response returns or capped at the gateway level. It integrates with OpenAI, Anthropic, LiteLLM, LangChain, LangGraph, and the Vercel AI SDK, so it drops into the same stack the tools above instrument. For multi-tenant SaaS, per-user metering enforces a separate spending limit for each end customer in real time, and Stripe passthrough billing lets you charge those customers for their own usage. The free tier covers 50,000 requests per month. Documentation and SDKs are at noburn.dev/docs.