Every team shipping production LLM features eventually hits the same debugging wall: you cannot debug what you cannot see. A multi-step agent fails in production, a RAG pipeline returns garbage, token costs triple overnight, and you have no trace of which call did what, which model changed, or which retry loop consumed the budget. That is the fundamental problem LLM observability tools solve, and the two platforms that come up most often in production discussions are Arize Phoenix and LangSmith.
They solve overlapping problems with opposite philosophies. Phoenix is open-source, OpenTelemetry-native, and runs wherever you want. LangSmith is fully managed, tightly coupled to the LangChain ecosystem, and runs on LangChain's infrastructure unless you pay for enterprise self-hosting. The hosting model is the obvious surface difference, but the deeper trade-offs affect data ownership, instrumentation lock-in, eval workflows, long-term pricing, and what happens to your traces if a vendor changes strategy. This comparison walks through both platforms in detail, covers when to pick each, and addresses the one critical gap neither solves: preventing a spend overrun before it happens.
How LLM observability fits into production operations
Before comparing specific tools, understand why observability matters for LLM applications. According to a 2025 O'Reilly survey, 73% of enterprises running LLM agents in production cited debugging and cost control as their top operational challenges. Traditional application monitoring (APM tools like Datadog or New Relic) measure latency and errors, but they do not understand token counts, model selection, retry behavior, or the cost implications of a prompt template change.
LLM observability fills this gap by capturing:
- Every LLM API call — prompt content, model, token counts (input and completion)
- Agent traces — the full tree of tool calls, retries, and reasoning steps
- Cost attribution — cost per user, per feature, per project
- Quality signals — latency, error rates, and (with some tools) output eval scores
Without this visibility, a 10x cost spike looks like a mystery. With observability, you can identify whether it was a single bad prompt, a retry storm, or a code deploy that changed the prompt template.
Arize Phoenix: Open-source observability at scale
Arize Phoenix is the open-source observability and evaluation platform from Arize AI, a San Francisco-based observability company founded in 2020. The project is licensed under the Elastic License 2.0, which permits free self-hosting with source code access but restricts commercial redistribution.
How Phoenix works
Phoenix instruments LLM applications through OpenInference, Arize's OpenTelemetry-based semantic convention for LLM traces. Because it uses the OpenTelemetry standard, instrumentation is not tied to any single framework. Arize maintains OpenInference instrumentors for:
- API providers: OpenAI, Anthropic, Cohere, Gemini, and others
- Frameworks: LangChain, LlamaIndex, DSPy, Vercel AI SDK
- Custom code: If you use a provider not yet instrumented, you can emit OpenTelemetry spans directly and Phoenix will read them
You can run Phoenix as:
- Python package:
pip install arize-phoenixand run locally - Docker container: Deploy to any Kubernetes cluster or VM you control
- Phoenix Cloud: Managed instance hosted by Arize (free tier available)
- Arize AX: The commercial platform that layers on RBAC, integrations, and advanced eval features
Strengths of Phoenix
True open-source. Self-hosting Phoenix requires no commercial license, no feature gatekeeping, and no seat-based pricing. If you have the infrastructure, it is free. This matters for teams with compliance constraints (finance, healthcare, legal) that cannot send trace data to a third-party cloud.
Framework agnostic. OpenTelemetry is an industry standard. If you switch from LangChain to DSPy to custom agents, the traces keep flowing because the underlying protocol is stable. This is the inverse of being locked into one framework's ecosystem.
Evaluation tools. Phoenix includes LLM-as-judge evaluators for hallucination, relevance, toxicity, and Q&A correctness out of the box. It also supports dataset-based evaluation: version a test set, run your prompt changes against it, and compare outputs side-by-side.
Data residency. Self-hosted Phoenix keeps all trace data on your infrastructure. No traces leave your network unless you choose to use Phoenix Cloud.
Limitations of Phoenix
Operational burden. Self-hosting means you own the database, retention policies, scaling, backups, and uptime. Phoenix does not handle multi-tenancy, RBAC, or SSO by default — you layer those on yourself. For a small team, this operational overhead might not be worth it.
Younger ecosystem. Phoenix is younger than LangSmith or commercial alternatives like Langfuse. The platform has fewer pre-built integrations, less institutional knowledge in ops teams, and smaller community.
Limited paid features. Deep collaboration features, team management, and advanced integrations live in the paid Arize AX platform, not in open-source Phoenix. If you outgrow the open-source tier, migration to AX requires a sales conversation and likely a higher price.
Pricing opacity. The boundary between free Phoenix Cloud and paid Arize AX shifts over time as Arize evolves its product, making it hard to predict long-term costs if you do not self-host.
LangSmith: Managed observability for the LangChain ecosystem
LangSmith is the managed observability and evaluation platform from LangChain, the company behind LangChain and LangGraph. If your application is built on LangChain or LangGraph, instrumentation is nearly automatic: set a few environment variables and traces start appearing.
How LangSmith works
LangSmith's primary design target is the LangChain execution model: chains, agents, tool calls, and retries. When you run an agent through LangChain's AgentExecutor, LangSmith automatically sees the entire tree: which tools were called, in what order, what inputs and outputs each produced, and how much time and tokens each step consumed. The trace view mirrors LangChain's internal state, so debugging is natural.
LangSmith works outside LangChain too. You can wrap arbitrary functions with the @traceable decorator or use the LangSmith SDK directly, so a plain OpenAI or Anthropic application can send traces. The experience is best inside LangChain, but it is not a hard requirement.
Strengths of LangSmith
Tight LangChain integration. If you are already using LangChain or LangGraph, instrumentation is close to free. Set LANGCHAIN_TRACING_V2=true and LANGCHAIN_API_KEY=<your_key> and traces appear with zero code changes. This low barrier to entry is a huge advantage.
Prompt management. LangSmith includes a prompt hub and versioning system. You can version prompts, test changes against a dataset, and promote versions to production — all without redeploying your app.
Annotation and human feedback. You can flag traces for review, add human judgments, and use those annotations to train evals or refine prompts.
Polished UI. The trace view, eval dashboards, and monitoring dashboards are well-designed. The UX is notably better than many open-source alternatives.
Limitations of LangSmith
Vendor lock-in on trace data. Traces (which contain your prompts and completions) live on LangChain's infrastructure by default. Unless you pay for enterprise self-hosting, your data is on their servers. This is a blocker for teams with strict data residency requirements.
Tight coupling to LangChain. While non-LangChain instrumentation is possible, the platform assumes LangChain is your primary framework. If you move to a different framework or mix LangChain with custom code, the ergonomics degrade.
Pricing at scale. LangSmith pricing (as of mid-2026) is per-seat plus per-trace. A trace is a top-level call to your chain or agent; each sub-call within that trace is a span (not a separate trace). For a team of 10 engineers plus production agents, costs climb quickly. Pricing varies by trace retention length, adding complexity.
No cost enforcement. LangSmith reports cost per trace after the call completes. It does not prevent an overspend before the call executes. This is a shared limitation with Phoenix but worth highlighting because cost control is a primary concern for production LLM apps.
Detailed comparison matrix
| Dimension | Arize Phoenix | LangSmith | Winner |
|---|---|---|---|
| Hosting model | Self-hosted or cloud | Cloud only (self-hosted at enterprise tier) | Phoenix (default free) |
| License | Elastic License 2.0 (open source) | Proprietary | Phoenix |
| Framework coupling | OpenTelemetry (vendor-neutral) | LangChain-native | Phoenix (less lock-in) |
| Setup effort | Moderate (self-hosting) | Low (environment variables) | LangSmith |
| Data residency | Your infrastructure | LangChain's servers (default) | Phoenix |
| Eval workflows | Built-in (LLM-as-judge) | Built-in (strong prompt management) | Tie |
| Team collaboration | Limited (RBAC in AX tier) | Good (teams, annotations) | LangSmith |
| Instrumentation breadth | 8+ frameworks via OpenInference | LangChain + manual SDK | Tie |
| Cost per month (small team) | Free (self) or $0–50 (cloud) | $39–200+ (per-seat + per-trace) | Phoenix |
| Cost per month (large team) | $100–500+ (infrastructure) | $500–2000+ (seat + trace scaling) | Depends on infra |
| Cost enforcement | No | No | Neither |
Real-world scenarios: when to pick each
Choose Phoenix if:
- Compliance requires data residency. You cannot send trace data to a third-party cloud. Phoenix self-hosting keeps everything on your infrastructure.
- You use multiple frameworks. Your stack mixes LangChain, DSPy, custom agents, and direct API calls. OpenTelemetry will capture all of it.
- Long-term cost is a priority. Self-hosting Phoenix has no per-seat or per-trace charges; costs scale with your infrastructure (which you can control).
- You have devops resources. You have SREs or MLOps engineers comfortable running and maintaining observability infrastructure.
Choose LangSmith if:
- Your entire stack is LangChain or LangGraph. Setup is trivial and the UX is excellent for that ecosystem.
- You value managed operations. You do not want to run and scale your own tracing infrastructure.
- Prompt management workflows matter. LangSmith's prompt hub and version control are industry-leading.
- You need team collaboration features out of the box. Annotations, RBAC, and sharing are built-in.
Use both if:
- You run LangChain agents (instrument with LangSmith) but also custom code or other frameworks (instrument with OpenTelemetry → Phoenix). Route both to a central Phoenix instance for a unified view.
The enforcement gap both platforms share
Here is the critical distinction both Phoenix and LangSmith share: they are observability tools, which means they record what already happened. Both show you cost per call, cost per project, and cost trends over time. That visibility is genuinely useful for understanding and optimizing. But it is fundamentally retrospective: the dashboard turns red after the spend has already left your account.
Neither tool can refuse a call. This is not a defect; enforcement is simply a different design goal from observability. Observability answers "what did this cost?" Enforcement answers "should this call be allowed to run at all?" Most production stacks need both, and most teams only have the first.
The missing piece is a pre-flight check: before the request goes to OpenAI or Anthropic, estimate the token cost, check the user's or project's remaining budget, and block the call if it would exceed the limit. This estimate-then-decide step must happen client-side, in the request path, in the milliseconds before the HTTP call fires. It cannot be downstream of trace data because by the time a trace exists the call has already completed and been billed.
For context on why enforcement matters, see our guides on cost spikes and retry logic cost blowups.
FAQ
Is Arize Phoenix free for production use? The self-hosted, open-source version of Phoenix is free with no trace caps or per-trace fees. You pay only for infrastructure (database, compute). Phoenix Cloud offers a free tier; larger deployments use paid tiers. Verify current boundaries at arize.com.
Can I use LangSmith without LangChain?
Yes. You can instrument arbitrary code with the @traceable decorator or the LangSmith SDK, so a plain OpenAI or Anthropic application can send traces. The platform's UX and feature set are optimized for LangChain users, but it is not a hard requirement.
Which platform is better for cost control? Neither has built-in cost enforcement. Both report cost after calls complete. To prevent cost overruns before they happen, you need an additional layer like noburn (see next section).
Can I self-host LangSmith? Yes, but only on the enterprise tier, which requires a sales conversation and typically a higher price point. Self-hosting is the default for Phoenix.
Do I need to choose between Phoenix and LangSmith? No. Teams often use LangSmith for LangChain-specific tracing and Phoenix (via OpenTelemetry) for cross-framework visibility, routing both streams to a unified Phoenix backend.
Where cost enforcement fits alongside observability
Observability and enforcement are complementary, not competitive. Use Phoenix or LangSmith (or both) to understand what calls cost and why. Add a pre-flight enforcement layer to prevent overspends from happening in the first place.
noburn is a cost enforcement platform that does what Phoenix and LangSmith structurally cannot: it estimates token cost before the API call fires and blocks the call if the user or project has exceeded their budget. It wraps the provider call directly through SDKs for OpenAI, Anthropic, LiteLLM, LangChain, LangGraph, and Vercel AI SDK, sitting in the same request path that gets traced by your observability tool. For multi-tenant SaaS, noburn enforces per-user budgets so a single end-customer cannot drain your margin. Stripe passthrough billing lets you charge customers for their own LLM usage without writing a billing layer.
The observability tool tells you what happened. The enforcement tool prevents the costly thing from happening in the first place. Together, they form a complete cost management strategy.
Free tier covers 50,000 requests per month. Documentation and SDKs are at noburn.dev/docs.
Key takeaways
- Phoenix excels at self-hosting and framework agnosticity. If you have data residency requirements or use multiple frameworks, it is the better choice.
- LangSmith excels at LangChain integration and managed simplicity. If your stack is entirely LangChain, LangSmith's setup and UX are hard to beat.
- Neither platform enforces budgets. Both report cost after the fact. For prevention, add a separate enforcement layer.
- Observability and enforcement are complementary. Use them together for complete cost visibility and control.
- Pricing differs significantly at scale. Phoenix self-hosting costs grow with infrastructure; LangSmith costs grow with team size and trace volume.