langchainproductioncost controlllm budgetagentic aitoken costs

LangChain in Production: Controlling Token Costs When Usage Is Unpredictable

LangChain abstracts the model call but not the cost. When agents retry, documents vary in size, and tools fire in unexpected sequences, a call budget set at design time breaks at runtime. Here is how to fix it.

nb

noburn.dev·2026-06-04

LangChain has become the standard framework for building production AI agents. According to a 2025 survey by Deeplearning.AI, over 65% of enterprises building multi-step LLM applications use LangChain or LangGraph. The framework abstracts away the complexity of prompt templates, tool calling, and agent loops — letting you focus on logic instead of low-level API details.

But LangChain abstracts one thing it should not: costs.

You design your agent to stay under $0.10 per call on average. It does — until one user queries a 500-page document, another hits a retry loop, and a third chains five tool calls in unexpected ways. A single call costs $2.40. Your margin evaporates. Your logs show nothing unusual because LangChain measures tokens but does not enforce limits.

The problem is structural. A call that looks identical in your code — chain.invoke(input) — can cost anywhere from $0.01 to $5 depending on input tokens, output length, and retry behavior. You cannot budget for what you cannot predict, and you cannot predict costs in LangChain without instrumentation.

The cost problem at scale

Consider a real case: a SaaS platform using LangChain agents for customer support automation. The average agent call processes a customer query and searches a knowledge base, costing roughly $0.08. Over 50,000 monthly calls, that is $4,000/month — acceptable for a $500K ARR business.

Then usage patterns shift. One power user starts submitting 100-page documents for analysis. The agent's context window doubles. Another user triggers a retry loop when the knowledge base is slow. A third discovers that chaining three sequential agents instead of one gives better results and does it on every call.

Token costs are no longer normally distributed around the mean. They follow a power-law tail. The 95th percentile call costs $1.20. The 99th percentile costs $4.50. If 1% of calls hit that tail, and you have 50,000 calls/month, that is 500 tail calls. At $4.50 each, that is $2,250 of unexpected cost on top of the $4,000 expected. Your margins compress from healthy to negative.

LangChain's built-in observability does not prevent this. It measures it. The distinction matters.

Why observability alone fails

LangChain's callback system provides LLMStartRunHandler and LLMEndRunHandler hooks that log token counts. Integrate LangSmith or Langfuse and you get a dashboard showing token usage per call, per user, per agent. This is invaluable for understanding what happened.

But it is post-hoc analysis. The call already fired. The tokens were consumed. The bill already reflects them. If a user submits an edge-case query that costs $5 instead of the expected $0.08, a dashboard tells you about it after the transaction posts. By then, the SaaS platform has already absorbed the cost.

The observability tools in this space — LangSmith, Langfuse, Arize Phoenix — are excellent for debugging, eval workflows, and understanding call patterns over time. But none of them stop a call before it fires. They are detection mechanisms, not prevention mechanisms.

Measurement: where to look

If you deploy observability first, focus on the tail, not the mean. Log token counts per user over at least 5-7 days to capture variability. Then analyze the distribution:

Mean — your budget assumption (what you designed for)
95th percentile — the cost of one call in twenty
99th percentile — the cost of one call in a hundred

If your mean is $0.10 but your 99th percentile is $2.40, and your product has 10,000 free-tier calls per month, you lose $240/month on tail events alone. That $240 is not a reporting error. It is a structural leak.

For LangChain specifically, integrate with LangSmith's API or use the native callbacks to track both prompt and completion tokens:

class TokenLogger(BaseCallbackHandler):
    def on_llm_end(self, response, **kwargs):
        prompt_tokens = response.llm_output.get("usage", {}).get("prompt_tokens", 0)
        completion_tokens = response.llm_output.get("usage", {}).get("completion_tokens", 0)
        user_id = kwargs.get("user_id")
        # Log to database or analytics
        log_tokens(user_id, prompt_tokens, completion_tokens)

Enforcement: preventing overruns

Measurement tells you where the leak is. Enforcement stops it from happening.

LangChain's callback system lets you intercept calls before they fire. A BaseCallbackHandler subclass can estimate the cost of an incoming request and block it if it would exceed a budget:

from langchain.callbacks.base import BaseCallbackHandler
import tiktoken

class BudgetEnforcer(BaseCallbackHandler):
    def __init__(self, user_id: str, monthly_budget_cents: int, pricing: dict):
        self.user_id = user_id
        self.budget_cents = monthly_budget_cents
        self.pricing = pricing  # {"gpt-4o": {"input": 0.015, "output": 0.06}}
        self.spent_cents = 0
        self.enc = tiktoken.encoding_for_model("gpt-4o")

    def on_llm_start(self, serialized: dict, prompts: list, **kwargs) -> None:
        # Estimate input cost
        prompt_tokens = sum(len(self.enc.encode(p)) for p in prompts)
        
        # Use 99th percentile output length for your workload (e.g., 2000 tokens)
        estimated_output = 2000
        
        model_name = serialized.get("model_name", "gpt-4o")
        input_pricing = self.pricing.get(model_name, {}).get("input", 0.015)
        output_pricing = self.pricing.get(model_name, {}).get("output", 0.06)
        
        input_cost_cents = (prompt_tokens / 1_000_000) * input_pricing * 100
        output_cost_cents = (estimated_output / 1_000_000) * output_pricing * 100
        total_estimated_cents = input_cost_cents + output_cost_cents
        
        # Block if over budget
        if self.spent_cents + total_estimated_cents > self.budget_cents:
            raise ValueError(
                f"Budget exceeded. User {self.user_id} has ${self.spent_cents/100:.2f} "
                f"spent; this call would cost ${total_estimated_cents/100:.2f}."
            )
        
        self.spent_cents += total_estimated_cents

The key detail is the output estimate. You do not know the actual output length until the model generates it. Use your 99th percentile output length, not the mean. This ensures that even worst-case calls are blocked before they happen.

Multi-level budgets: users, projects, and features

Production deployments need multiple budget scopes:

Per-user: $5/month for free users, $50/month for pro users
Per-project: $200/month for one customer's entire project
Per-feature: $0.50 per customer query in the support chatbot, unlimited for internal admin tools
Per-session: $2 per customer conversation, reset daily

LangChain supports chaining multiple handlers. Each one can enforce a different budget scope:

from langchain.agents import create_react_agent, AgentExecutor

enforcer_user = BudgetEnforcer("user_123", 50000, pricing)  # $500/month
enforcer_project = BudgetEnforcer("project_456", 200000, pricing)  # $2000/month
enforcer_feature = BudgetEnforcer("feature_support_chat", 10000, pricing)  # $100/month

executor = AgentExecutor.from_agent_and_tools(
    agent=agent,
    tools=tools,
    callbacks=[enforcer_user, enforcer_project, enforcer_feature],
    verbose=True
)

result = executor.invoke({"input": user_query})

Handlers execute in order. If any budget is exceeded, the agent halts before the API call. This is critical: the cost is prevented, not merely recorded.

The limitation: you still need observability

Cost enforcement is not a replacement for observability. You need both:

Observability (LangSmith, Langfuse, Phoenix) tells you what happened — which agent step was slow, which tool call was expensive, which user triggered the tail. Use it for debugging and optimization.
Enforcement (budget handlers, LLM-native rate limits) tells the system not to do it. Use it to prevent overruns.

The gap most LLM observability tools fill is real but limited. They see the tokens after they are spent. If your goal is to prevent a user from spending $50 when they have $10 remaining, observability tools cannot help — they tell you about the overage post-transaction.

That is why enforcement must happen at the API call layer, before the request fires.

LangSmith vs. cost enforcement

LangSmith is the official observability platform from the LangChain team. It integrates tightly with LangChain and LangGraph, traces the full agent tree, and provides eval frameworks. It is excellent for understanding agent behavior.

LangSmith does not enforce budgets. It shows you that a user spent $50 on a single call. It does not stop the call from happening.

For cost control, you need a separate layer. This is the structural limitation: LangChain (and most LLM frameworks) separate the observability layer from the enforcement layer, and cost control requires enforcement to work.

Real-world integration patterns

Pattern 1: Simple per-user limit

user_budget = 100  # cents = $1.00
user_enforcer = BudgetEnforcer(user_id, user_budget, gpt4o_pricing)
agent = create_react_agent(model, tools, callbacks=[user_enforcer])

Good for: Simple SaaS with a single agent per user.

Pattern 2: Tiered budgets by user tier

tier_limits = {
    "free": 10000,      # $100/month
    "pro": 100000,      # $1000/month
    "enterprise": None, # unlimited
}

user_tier = get_user_tier(user_id)
budget_cents = tier_limits[user_tier]

enforcer = BudgetEnforcer(user_id, budget_cents, gpt4o_pricing) if budget_cents else None
callbacks = [enforcer] if enforcer else []
agent = create_react_agent(model, tools, callbacks=callbacks)

Good for: Freemium SaaS where LLM cost scales with user tier.

Pattern 3: Per-feature budgets

feature_budgets = {
    "customer_support": 50000,    # $500/month
    "analytics_dashboard": 20000,  # $200/month
    "internal_research": None,     # unlimited
}

feature = request.get("feature")
budget = feature_budgets.get(feature)

enforcer = BudgetEnforcer(f"feature_{feature}", budget, gpt4o_pricing) if budget else None
agent = create_react_agent(model, tools, callbacks=[enforcer] if enforcer else [])

Good for: Platforms where different features have different cost constraints.

Where noburn fits in this architecture

noburn takes the enforcement pattern and simplifies it. Instead of subclassing BaseCallbackHandler yourself, wrap your LangChain client at initialization:

from noburn import with_noburn_budget
from langchain.chat_models import ChatOpenAI

model = with_noburn_budget(
    ChatOpenAI(model="gpt-4o"),
    user_id="user_123",
    monthly_budget_cents=50000,  # $500/month
    project_id="project_456",
    project_budget_cents=500000   # $5000/month for the entire project
)

agent = create_react_agent(model, tools)
result = agent.invoke({"input": user_query})

noburn handles:

Accurate token estimation — uses the same tokenizers as the model providers, not approximations
Model-aware pricing — knows current rates for GPT-4o, Claude, Gemini, Llama, and others (updated automatically)
Retry amplification — detects and accounts for retries that would double or triple cost
Per-user and per-project metering — enforce multiple budgets simultaneously
Billing integration — Stripe passthrough lets you bill users for their LLM usage directly, without a separate billing layer

The free tier covers 50,000 requests per month. For higher volume, per-request pricing starts at $0.001. Documentation and SDKs are at noburn.dev/docs.

Key takeaways

LangChain abstracts cost as much as it abstracts API calls. You cannot assume uniform costs.
Observability measures; enforcement prevents. You need both. Observability alone tells you after you overspend.
Budget at percentiles, not averages. The 99th percentile call is what breaks your unit economics, not the mean.
Multi-level budgets are essential in production. Per-user, per-project, and per-feature limits all matter.
Enforcement must happen before the API call fires. Post-call detection is too late.