LLM observability is the practice of logging, tracing, and analyzing what your language model calls are doing in production. Every serious AI product needs it. The category has five viable tools in 2026, two of which are in the middle of acquisition transitions that meaningfully affect their roadmaps.
This post covers what each tool actually tracks, what each one costs, and where the entire category shares a structural blind spot that no amount of logging solves.
What does LLM observability actually mean?
LLM observability means capturing the inputs, outputs, latency, token counts, costs, and errors of every model call your application makes — and making that data queryable after the fact. A well-instrumented LLM application tells you: which model was called, with what prompt, by which user, at what cost, with what latency, and whether the response was what you expected.
The practice borrowed the word "observability" from infrastructure monitoring (Datadog, Grafana), but the underlying data is different. Infrastructure observability captures metrics and traces across services. LLM observability captures the semantic content of AI interactions — something that has no equivalent in traditional APM.
The category split into two distinct segments: proxy-based tools (Helicone, Portkey) that sit in the request path and log everything automatically, and SDK-based tools (LangSmith, Langfuse, Arize Phoenix) that require manual instrumentation but support richer tracing across multi-step chains and agents.
Which LLM observability tools are worth using in 2026?
The five tools with meaningful adoption are Helicone, Portkey, LangSmith, Langfuse, and Arize Phoenix. Two are in acquisition transitions. Three are actively developed. Here is the breakdown.
Is Helicone still being actively developed?
No. As of March 3, 2026, Helicone is in maintenance mode following its acquisition by Mintlify. Security updates, new model pricing, and bug fixes continue. New feature development has stopped.
Helicone is the fastest way to get LLM visibility in 2026 — and the worst bet for anything you need to still be running in 2027. Change one baseURL and you get full LLM logging in under two minutes. No SDK, no manual instrumentation, no schema changes. For teams that need immediate cost visibility on a simple single-provider product, nothing is faster.
The limitations that existed before the acquisition are now permanent:
- No hard budget caps — Helicone shows you what you spent, it does not stop you from spending it
- Rate limiting is by request count, not by dollar amount
- The $720/month pricing cliff between Pro ($79/mo) and Team ($799/mo) has no middle tier and will not get one
- Sessions-level tracing exists but is shallower than SDK-based alternatives for complex agent workflows
Mintlify builds documentation tools. There is no announced roadmap for Helicone under Mintlify ownership. Treating Helicone as a long-term production dependency is a planning mistake.
Best for: fast visibility on new projects where you need logs today and can migrate later.
What does Portkey do that Helicone doesn't?
Portkey is a full gateway stack — routing, fallbacks, load balancing, semantic caching, guardrails, and observability in one product. It covers substantially more than Helicone's logging-first approach.
On April 30, 2026, Palo Alto Networks announced its intent to acquire Portkey, pending close in Q4 FY2026. Unlike Helicone's maintenance mode announcement, Portkey's acquisition is positioned as an acceleration — Palo Alto is buying the AI gateway capabilities for its enterprise security stack.
Palo Alto is a $100B+ enterprise security vendor. When a company at that scale acquires a developer tool, the outcome is predictable: pricing moves upmarket, enterprise features become the roadmap, and self-serve tiers receive less investment. I would not build a new product on Portkey's $49/mo tier today without a clear plan for what happens when the post-close pricing changes.
The enforcement limitation is unchanged by the acquisition: budget caps are an Enterprise-only feature. The Production tier at $49/mo gives you routing and observability. It does not block calls when a user's budget is exhausted. That capability requires Enterprise pricing — custom, sales-gated, likely $500+/mo. For a technical deep-dive on why this gap matters, see How to Set a Hard Budget Cap on LLM API Calls.
Best for: teams that need multi-provider routing, fallbacks, and guardrails at the gateway layer, and who are comfortable with the Palo Alto acquisition risk. Poor fit for founders who need cost enforcement without an enterprise contract.
Is LangSmith a good choice for LLM observability?
Yes, if you are using LangChain or LangGraph. LangSmith is the native tracing and evaluation platform for the LangChain ecosystem. It instruments chains, agents, and tools automatically when you use LangChain's SDK — there is no separate integration step.
Outside the LangChain ecosystem, LangSmith requires more manual instrumentation than alternatives. It also does not include cost enforcement at any tier — LangSmith's own comparison documentation acknowledges it lacks cost-saving features beyond basic cost tracking.
Pricing is $39/seat/month for the Plus tier (10k base traces, then $2.50/1k traces). For teams with many developers and high trace volume, this competes poorly against usage-based alternatives.
Best for: teams building with LangChain or LangGraph who want zero-friction tracing within that ecosystem. Poor fit for teams that have outgrown LangChain or need cost enforcement.
What is Langfuse and how does it compare?
Langfuse is an open-source LLM engineering platform — self-host or managed cloud — that covers tracing, evals, prompt management, and cost tracking. It is SDK-first rather than proxy-first, which means deeper tracing at the cost of more integration work.
Langfuse is what Helicone would be if Helicone were still being built. The cloud version charges $8 per 100k units on paid tiers, with a generous free tier that covers most early-stage usage. Self-hosting is MIT-licensed with no feature gates on core tracing functionality — you get the full product without a commercial agreement.
The key differentiator over Helicone and LangSmith: Langfuse supports parent-child span relationships across multi-step chains and agent workflows. You can trace an entire agent run as a single observable unit — with subtrace visibility into individual tool calls, retrieval steps, and model calls — rather than a collection of disconnected requests. At $8 per 100k units, it is also substantially cheaper than LangSmith at comparable trace volumes.
Like every tool in this list, Langfuse does not enforce budget caps. It tracks what was spent after the fact.
Best for: teams that want deep agent tracing, active open-source development, and a self-hosting option that does not require a commercial license.
When should you use Arize Phoenix for LLM observability?
When OpenTelemetry portability is non-negotiable or you need a self-hosted solution with no feature gates. Arize Phoenix is the open-source tier of Arize's platform — fully self-hostable, Apache 2.0 licensed, and built on the OpenInference schema for OTel compatibility.
In practical terms, Phoenix supports agent tracing graphs, trajectory mapping that catches recursive loops, and dual-level evaluation covering both individual tool calls and session-level goal completion. These features are available in the self-hosted version without a commercial license.
The commercial tier, Arize AX, adds online evaluations, production monitoring, and the Alyx AI assistant. AX Pro is $50/mo for 50,000 spans with $10 per million additional spans. The 50k span cap is restrictive for agent workflows — a coding agent making 20 tool calls per task exhausts the quota in approximately 2,500 runs.
Best for: teams with strict self-hosting requirements, OTel-native stacks, or who need deep agent tracing without a vendor dependency. The commercial tier is better evaluated against Langfuse than against Helicone or Portkey.
LLM observability tools compared
| Tool | Type | Pricing | Budget caps | Agent tracing | Status |
|---|---|---|---|---|---|
| Helicone | Proxy | Free / $79 / $799/mo | ✗ Never | Sessions only | Maintenance mode (Mintlify) |
| Portkey | Gateway | Free / $49 / Enterprise | Enterprise only | ✓ | Acquisition pending (Palo Alto) |
| LangSmith | SDK | Free / $39/seat/mo | ✗ Never | ✓ (LangChain-native) | Active |
| Langfuse | SDK | Free / usage-based / self-host | ✗ Never | ✓ | Active, open source |
| Arize Phoenix | SDK | Free (OSS) / $50/mo (AX Pro) | ✗ Never | ✓ | Active, open source |
Every tool in this table logs what happened. None block what is about to happen.
Does LLM observability include cost enforcement?
No. This is the category's structural limitation, and it is by design. Observability tools capture data about calls that have already been sent. By the time a token count is logged, the request has been processed and the cost is on your invoice.
For deterministic workloads — a chatbot with a fixed system prompt and predictable user inputs — this is acceptable. Cost drift is visible in the dashboard within hours and correctable manually. For agent workloads, the gap is not acceptable. An agent that retries on failure at 3am, processes a 500k-token document because the user uploaded the wrong file, or spawns parallel subagents can exhaust a month's budget in twenty minutes. The observability dashboard will show you the damage clearly. It will not have prevented any of it.
The mechanism that prevents them is pre-flight enforcement: estimating the cost of each call before it fires and rejecting calls that would exceed the remaining budget. This is a different architectural layer than observability — it sits upstream of the model call, not downstream. For teams that need both observability and enforcement, the enforcement layer requires a separate tool. See Per-User LLM Billing: The Gap Nobody Has Filled for the full picture on what that combination requires.
Frequently asked questions
What is the difference between LLM observability and LLM monitoring?
The terms are used interchangeably in most contexts. "Monitoring" typically refers to threshold-based alerting on metrics (latency > 2s, error rate > 5%). "Observability" refers to the broader ability to ask arbitrary questions about system state from the outside — including cost attribution, prompt inspection, and trace analysis. All five tools in this post cover both, but their depth varies significantly on the tracing side.
Which LLM observability tool is easiest to set up?
Helicone is the fastest to set up — one baseURL change and logging starts immediately, no SDK required. Portkey is similarly fast for proxy-based logging. Langfuse and Arize Phoenix require SDK integration but provide richer tracing in exchange. LangSmith requires the least integration if you are already using LangChain.
Can I use LLM observability to enforce a budget cap?
No. Observability tools log spend after calls fire. Budget enforcement requires blocking calls before they fire, which is a pre-flight mechanism. Tools like noburn.dev provide pre-flight enforcement as a separate layer that works alongside observability tooling.
Is Langfuse better than Helicone?
For teams that need deep multi-step tracing and long-term reliability, yes. Helicone is simpler to set up but is now in maintenance mode under Mintlify with no active feature development. Langfuse is open source, actively developed, and provides substantially deeper agent tracing through SDK-level instrumentation. The trade-off is integration complexity: Langfuse requires more code than Helicone's proxy approach.
What happens to Helicone now that Mintlify acquired it?
Helicone's own announcement states the platform is in maintenance mode — security updates and bug fixes continue, but new feature development has stopped. Mintlify is a documentation tool company; the Helicone product direction under Mintlify ownership has not been publicly detailed. Teams evaluating Helicone for new projects should treat maintenance mode as a concrete long-term risk.