Once an agent ships to production, the question stops being "does it work" and becomes "what did it just cost, and can I see why." A single agent run can fan out into dozens of LLM calls across tool loops, retries, and sub-agents. Without tracing, a $4,000 monthly bill is an opaque number. With tracing, it is a span tree you can actually debug.
LangSmith and Langfuse are the two tools most production agent teams shortlist for that job. They overlap heavily on the core feature, request tracing, but they diverge sharply on licensing, self-hosting, and pricing model. This comparison covers what each one does, what it costs as of early 2026, where the limits are, and the one thing neither of them does: stop a call before it spends money.
LangSmith: the LangChain-native tracer
LangSmith is built by the LangChain team. If your stack already runs on LangChain or LangGraph, instrumentation is close to free: set three environment variables (LANGCHAIN_TRACING_V2, LANGCHAIN_API_KEY, LANGCHAIN_PROJECT) and traces start flowing without code changes. It also has a standalone SDK and an OpenTelemetry endpoint, so you can use it without LangChain, but the tightest integration is with LangChain's own runtimes.
Beyond tracing, LangSmith covers prompt versioning, datasets, offline evaluations, and LLM-as-judge scoring. The evaluation tooling is its strongest differentiator. You can pin a dataset, run a prompt change against it, and diff the scores before promoting to production.
Pricing as of early 2026 (check current pricing before budgeting): the Developer plan is free for one seat and includes a monthly allotment of base traces, then bills per additional trace. The Plus plan is roughly $39 per seat per month with a larger included trace volume and the same per-trace overage. Enterprise is custom-priced and is the only tier that unlocks self-hosting inside your own VPC.
Key limitations:
- Per-seat pricing scales with your team, not your usage. A five-engineer team on Plus pays for five seats whether or not all five look at traces.
- Self-hosting is Enterprise-only. If data residency or air-gapping is a hard requirement, you are in a sales conversation, not a signup form.
- Trace overages are usage-based. High-volume agents that emit many spans per run can push the per-trace meter faster than teams expect.
LangSmith is not an acquisition target or a spun-off product. It is LangChain's commercial offering, which means its roadmap tracks the LangChain and LangGraph ecosystem directly. That is an advantage if you live in that ecosystem and a coupling risk if you do not.
Langfuse: the open-source alternative
Langfuse is an open-source LLM engineering platform. The core is MIT-licensed and self-hostable with a docker compose up, which is the reason most teams evaluate it. You can run the full tracing stack on your own infrastructure with no feature gate on the observability core, and your trace data never leaves your network.
Functionally it covers the same ground as LangSmith: tracing, prompt management, evaluations, datasets, and scoring. It is framework-agnostic by design, with SDKs for Python and JavaScript, native OpenTelemetry support, and integrations for OpenAI, Anthropic, LiteLLM, LangChain, and the Vercel AI SDK. If your stack is heterogeneous, Langfuse tends to fit with less friction than a LangChain-native tool.
Pricing as of early 2026 (check current pricing): self-hosting the open-source core is free, with some advanced features (such as fine-grained RBAC and SSO enforcement) gated behind a paid self-hosted tier. Langfuse Cloud offers a free Hobby tier with a monthly event allotment, a Core plan around $29/mo, and a Pro plan around $199/mo, billed primarily on event volume rather than per seat.
Key limitations:
- Self-hosting is real work. "Free to self-host" still means you run Postgres, ClickHouse, Redis, and the app, plus upgrades and backups. The license is free; the operational cost is not zero.
- Some enterprise controls are paid. SSO enforcement and granular RBAC sit behind the commercial tiers even when you self-host.
- Event-based metering needs forecasting. High-cardinality agent traces generate many events, and the Cloud tiers meter on that volume.
Langfuse is an independent company (YC W23, venture-backed) and has not been acquired. For teams that treat vendor independence and an open core as procurement requirements, that status is part of the appeal.
Side-by-side comparison
| Dimension | LangSmith | Langfuse | noburn.dev |
|---|---|---|---|
| Primary function | Tracing + evals | Tracing + evals | Pre-flight cost enforcement |
| When it acts | After the call (logs) | After the call (logs) | Before the call (blocks) |
| Enforcement model | None (observe only) | None (observe only) | Hard block when over budget |
| Self-hosting | Enterprise tier only | Open-source core, free | Managed (SaaS) |
| Pricing basis | Per seat + per trace | Per event (Cloud) / free self-host | Per request, flat tiers |
| Free tier | 1 seat, limited traces | Hobby tier, limited events | 50k req/mo, 1 project |
| Paid entry | ~$39/seat/mo (Plus) | ~$29/mo (Core) | $9/mo (500k req, locked for first 50 users) |
| Per-user metering | No | No | Yes (per end-customer) |
| Ownership | LangChain | Independent (YC W23) | Independent |
Pricing figures are as of early 2026. Confirm current numbers on each vendor's pricing page before you commit a budget.
The enforcement gap
Here is the structural problem both tools share: they are observability platforms, and observability is a read path. A trace is written after the model call returns. By the time LangSmith or Langfuse shows you that a run cost $12, the $12 is already spent. You can alert on it, chart it, and build a dashboard that turns red, but the dashboard turning red does not stop the next call from firing.
For an internal analytics tool that nobody runs at 3 a.m., that lag is fine. You review spend weekly and move on. For a production multi-tenant SaaS where end-customers trigger agent runs, the lag is the whole problem. A single customer with a runaway loop, a prompt-injection that talks your agent into a 50-step tool spiral, or a free-tier user hammering an expensive model can run up real money in the minutes before any alert fires, let alone before a human reacts to it.
What teams actually need at that layer is a gate, not a chart. The call should be checked against a budget before it executes, and rejected if the user or project is already over the line. That is pre-flight enforcement, and it is a different architectural position from observability. Observability sits beside the call and records it. Enforcement sits in front of the call and can refuse it.
Neither LangSmith nor Langfuse occupies that position, and that is not a defect in either product. It is a different job. You can wire alerts from either tool into a kill switch you build and maintain yourself, but you are then writing the budget logic, the per-user accounting, and the blocking middleware by hand, and racing the alert latency every time.
FAQ
Is LangSmith or Langfuse better for a LangChain/LangGraph stack? LangSmith has the tightest LangChain and LangGraph integration because the same team builds all three; tracing turns on with environment variables and no code changes. Langfuse supports LangChain through a callback handler and works fine, but the zero-config path belongs to LangSmith.
Can I self-host either one for free? Langfuse's open-source core is free to self-host under an MIT license, with some enterprise controls like SSO enforcement gated behind a paid tier. LangSmith self-hosting is available only on its Enterprise plan, so there is no free self-hosted path.
Do LangSmith or Langfuse enforce budgets or block expensive calls? No. Both are observability tools that record cost after a call completes. They can trigger alerts on spend, but neither blocks an API call before it fires, so they cannot prevent a runaway agent from spending money in real time.
How do I cap per-customer LLM spend in a multi-tenant app? Neither LangSmith nor Langfuse meters or enforces limits per end-customer out of the box; both track usage at the project or trace level. To enforce a separate budget per customer you need a layer that checks each request against that customer's limit before the call, which is what pre-flight enforcement tools do.
Which one is cheaper at scale? It depends on your usage shape. LangSmith bills per seat plus per trace, so cost rises with team size and trace volume. Langfuse Cloud bills primarily on event volume, and self-hosting trades the subscription for infrastructure and operational overhead. Model both against your actual trace and seat counts rather than the headline price.
Where noburn fits in this stack
LangSmith and Langfuse tell you what a call cost after it returned. noburn does the thing neither does: it estimates the token cost of a request client-side, before the API call fires, and blocks the request when the user or project is already over budget. It runs in front of your existing SDK calls for OpenAI, Anthropic, LiteLLM, LangChain, LangGraph, and the Vercel AI SDK, so you keep your tracing tool for visibility and add enforcement as a separate, complementary layer. For multi-tenant teams, noburn meters spend per end-customer and supports Stripe passthrough billing, so you can both cap and charge for each customer's LLM usage without writing the accounting yourself. The free tier covers 50,000 requests per month. Documentation and SDKs are at noburn.dev/docs.