langgraphlangchainbudget enforcementai agentsllm cost controlper-user spending limitagent cost trackingllm budget

LangGraph Budget Enforcement: Capping Costs Without Rewriting Your Graph

LangGraph gives you stateful agents and conditional routing. It does not give you a cost ceiling. Here is how to add per-user budget enforcement to an existing LangGraph workflow without restructuring the graph.

nb

Editorial·2026-06-01

You have a working LangGraph agent. It routes between nodes, maintains conversation state, calls tools, and loops until it has an answer. What it does not have is a cost ceiling. If a user submits a prompt that triggers five tool calls and three LLM rounds, your bill reflects that, regardless of what you had in mind.

The obvious fix — set max_tokens on the model — does not solve the problem. max_tokens caps a single completion, not total spend across a multi-turn agent run. A ten-node graph with a 512-token limit per call can still burn $2 per invocation if it loops enough times. And for multi-tenant SaaS, you have a second dimension: each customer needs their own budget, not a global cap shared across all users.

LangGraph does not provide a budget hook. There is no on_before_node middleware, no built-in spending ledger, no way to abort mid-graph when a threshold is crossed. What it does provide is a clean model for state and node registration that you can exploit to add enforcement without touching your existing node logic.

Prerequisites

langgraph >= 0.2 and langchain-core >= 0.3
An existing compiled StateGraph (the technique works on any graph topology)
Python 3.11+
A persistent store for per-user budgets if you need cross-run enforcement (Redis or Postgres; examples below use a plain dict with a note on production options)

You do not need to restructure your graph. The approach wraps nodes at registration time, which means your node functions stay untouched.

How LangGraph Passes State

Every node in a LangGraph graph is a plain Python callable that receives the current state and returns a partial update. The state is a TypedDict, and the graph merges each node's return dict into the running state before the next node executes.

Here is a minimal ReAct-style state schema extended with budget tracking fields:

from typing import TypedDict, Annotated, Sequence
from langchain_core.messages import BaseMessage
import operator

class AgentState(TypedDict):
    # Standard conversation history
    messages: Annotated[Sequence[BaseMessage], operator.add]

    # Budget tracking fields — added alongside your existing fields
    tokens_used: int           # cumulative tokens across all LLM calls this run
    cost_usd: float            # cumulative spend in USD this run
    budget_usd: float          # per-user limit loaded at graph entry
    budget_exhausted: bool     # True once cost_usd >= budget_usd

budget_exhausted is a boolean flag rather than a threshold comparison in every node. You set it once when the budget is crossed, then route on it anywhere in the graph. Any node that needs to know the budget status reads a single field — no repeated arithmetic scattered through your logic.

Per-User Budget Store

For a single-process server, a plain dict is enough to get started:

# budget_store.py
from threading import Lock

_store: dict[str, float] = {}
_lock = Lock()

DEFAULT_BUDGET_USD = 1.00  # $1.00 per user per day, adjust as needed

def get_budget(user_id: str) -> float:
    with _lock:
        return _store.get(user_id, DEFAULT_BUDGET_USD)

def deduct(user_id: str, amount_usd: float) -> float:
    """Deduct spend and return the remaining budget. Never goes below zero."""
    with _lock:
        current = _store.get(user_id, DEFAULT_BUDGET_USD)
        remaining = max(0.0, current - amount_usd)
        _store[user_id] = remaining
        return remaining

def reset(user_id: str) -> None:
    with _lock:
        _store[user_id] = DEFAULT_BUDGET_USD

In production, replace this module with a Redis hash or a Postgres row. The interface — get_budget, deduct, reset — stays the same. Your node wrapper (below) calls only these three functions, so the swap is a one-file change.

Redis sketch for production:

import redis

_r = redis.Redis(host="localhost", port=6379, decode_responses=True)

def get_budget(user_id: str) -> float:
    val = _r.get(f"budget:{user_id}")
    return float(val) if val is not None else DEFAULT_BUDGET_USD

def deduct(user_id: str, amount_usd: float) -> float:
    # INCRBYFLOAT with a negative value; Lua eval is atomic on a single Redis instance.
    script = """
        local key = KEYS[1]
        local amount = tonumber(ARGV[1])
        -- GET then SET is safe inside Lua eval (Redis guarantees no interleaving),
        -- but if you run Redis Cluster or need cross-shard atomicity, use a pipeline
        -- with WATCH instead. For true NX-safe initialization across concurrent
        -- writers, replace the GET fallback below with: redis.call('SET', key, ARGV[2], 'NX')
        local current = tonumber(redis.call('GET', key) or ARGV[2])
        local updated = math.max(0, current - amount)
        redis.call('SET', key, updated)
        return tostring(updated)
    """
    # For true atomicity under concurrent writes, use a Redis pipeline with WATCH or a Lua NX guard.
    result = _r.eval(script, 1, f"budget:{user_id}", amount_usd, DEFAULT_BUDGET_USD)
    return float(result)

LangGraph Cost Enforcement: The Node-Wrapping Pattern

The key insight is that LangGraph lets you register any callable as a node. That callable can be a closure. A closure can carry budget-checking logic that fires before and after the inner function without the inner function knowing it exists.

First, a helper that converts token counts to USD. Token pricing varies by model; keep this table in one place:

# pricing.py

# Prices in USD per 1,000 tokens (input, output)
MODEL_PRICING: dict[str, tuple[float, float]] = {
    "claude-sonnet-4":     (0.003,   0.015),
    "claude-opus-4":       (0.015,   0.075),
    "claude-haiku-4":      (0.0008,  0.004),
    "gpt-4.1":             (0.002,   0.008),
    "gpt-4.1-mini":        (0.0004,  0.0016),
    "gemini-2.5-flash":    (0.0003,  0.0025),  # (Google uses tiered pricing above 128k context — flat rate shown for requests under that limit)
}

def tokens_to_usd(model: str, input_tokens: int, output_tokens: int) -> float:
    input_rate, output_rate = MODEL_PRICING.get(model, (0.01, 0.03))
    return (input_tokens / 1000 * input_rate) + (output_tokens / 1000 * output_rate)

Now the wrapper factory. It takes any node function and a user_id, and returns a budget-aware replacement:

# budget_wrapper.py
from typing import Callable
from langchain_core.messages import AIMessage
import budget_store
import pricing

def budget_enforced(node_fn: Callable, user_id: str, model: str) -> Callable:
    """
    Wraps a LangGraph node function with per-user LangGraph budget enforcement.

    - Checks budget BEFORE calling the node.
    - If exhausted, returns budget_exhausted=True without running the node.
    - If the node runs, extracts token usage from the LLM response and deducts spend.
    """
    def wrapper(state: dict) -> dict:
        # Pre-call: check current budget
        remaining = budget_store.get_budget(user_id)
        if remaining <= 0 or state.get("budget_exhausted", False):
            return {"budget_exhausted": True}

        # Run the actual node
        result = node_fn(state)

        # Post-call: extract token usage from any AIMessage in the result.
        # NOTE: tool nodes return ToolMessage objects, not AIMessage, so this loop
        # only captures model inference calls. Tool execution costs (e.g. external
        # API calls made inside call_tools) are not reflected here — they must be
        # tracked separately if needed. This is an intentional design decision:
        # the wrapper measures what the LLM itself spends, not side-effect costs.
        call_cost = 0.0
        messages = result.get("messages", [])
        for msg in messages:
            if isinstance(msg, AIMessage) and msg.usage_metadata:
                # ToolMessage objects from tool nodes are skipped here by design —
                # only AIMessage carries usage_metadata from model inference.
                input_tokens = msg.usage_metadata.get("input_tokens", 0)
                output_tokens = msg.usage_metadata.get("output_tokens", 0)
                call_cost += pricing.tokens_to_usd(model, input_tokens, output_tokens)

        if call_cost > 0:
            remaining_after = budget_store.deduct(user_id, call_cost)
            result["tokens_used"] = state.get("tokens_used", 0) + (
                sum(
                    (m.usage_metadata.get("input_tokens", 0) + m.usage_metadata.get("output_tokens", 0))
                    for m in messages
                    if isinstance(m, AIMessage) and m.usage_metadata
                )
            )
            result["cost_usd"] = state.get("cost_usd", 0.0) + call_cost

            if remaining_after <= 0:
                result["budget_exhausted"] = True

        return result

    return wrapper

usage_metadata is populated automatically by LangChain's model integrations when you call .invoke(). If you are on an older integration that does not populate it, you can instrument the LLM call directly using callbacks and a custom BaseCallbackHandler — the budget deduction logic is the same, just triggered in on_llm_end instead.

One subtlety worth noting: the cost loop inspects only AIMessage objects. Tool nodes in LangGraph return ToolMessage objects, which carry no usage_metadata. This means the wrapper tracks model inference costs only. If your tools make their own paid API calls, those costs are invisible to this loop and must be tracked at the tool level. This is a deliberate scope boundary — see the Limitations section below.

Wiring It Into Your LangGraph Graph

Assume you already have node functions call_model and call_tools. Here is how you register them with per-user budget enforcement and add conditional routing:

from langgraph.graph import StateGraph, END
from budget_wrapper import budget_enforced
import budget_store  # replace with your actual budget store implementation

def build_graph(user_id: str) -> "CompiledGraph":
    model = "gpt-4.1-mini"
    budget = budget_store.get_budget(user_id)

    graph = StateGraph(AgentState)

    # Register wrapped nodes — original functions are untouched
    graph.add_node("agent", budget_enforced(call_model, user_id, model))
    graph.add_node("tools", budget_enforced(call_tools, user_id, model))
    graph.add_node("budget_exceeded", budget_exceeded_node)

    graph.set_entry_point("agent")

    # Conditional routing: check budget_exhausted before each tool call
    graph.add_conditional_edges(
        "agent",
        route_after_agent,
        {
            "tools": "tools",
            "end": END,
            "budget_exceeded": "budget_exceeded",
        },
    )
    graph.add_edge("tools", "agent")
    graph.add_edge("budget_exceeded", END)

    return graph.compile()


def route_after_agent(state: AgentState) -> str:
    if state.get("budget_exhausted"):
        return "budget_exceeded"
    # Standard ReAct routing: if the model called a tool, go to tools node
    last_message = state["messages"][-1]
    if hasattr(last_message, "tool_calls") and last_message.tool_calls:
        return "tools"
    return "end"


def budget_exceeded_node(state: AgentState) -> dict:
    """Early-exit node that appends a user-facing message and halts the graph."""
    from langchain_core.messages import AIMessage
    return {
        "messages": [
            AIMessage(
                content=(
                    f"I've reached the spending limit for this request "
                    f"(${state.get('cost_usd', 0):.4f} used of "
                    f"${state.get('budget_usd', 0):.2f} budget). "
                    "Please start a new session or contact support to increase your limit."
                )
            )
        ]
    }

The graph topology is unchanged from a standard ReAct agent. The only additions are the wrapper at registration time and budget_exceeded as an early-exit node. Your existing call_model and call_tools functions have zero knowledge that budget enforcement exists.

Running the Graph with Per-User LangGraph Cost Caps

Invoke the compiled graph exactly as you would without budget enforcement:

from langchain_core.messages import HumanMessage

user_id = "user_abc123"
graph = build_graph(user_id)

initial_state = {
    "messages": [HumanMessage(content="Summarize the Q1 earnings report and compare to Q4.")],
    "tokens_used": 0,
    "cost_usd": 0.0,
    "budget_usd": budget_store.get_budget(user_id),  # budget_store: your BudgetStore instance from above
    "budget_exhausted": False,
}

result = graph.invoke(initial_state)

print(f"Cost this run: ${result['cost_usd']:.4f}")
print(f"Budget exhausted: {result['budget_exhausted']}")
print(result["messages"][-1].content)

If the user has $0.10 remaining and the run tries to spend $0.40, the graph will execute nodes until the budget is crossed, then route to budget_exceeded and return a clean response instead of continuing. The user gets a message. Your bill gets a hard ceiling.

What Happens When the LangGraph Budget Is Exhausted

There are three things that happen in sequence:

The wrapper returns {"budget_exhausted": True} from the node that crossed the limit. The node's actual output is still included — you do not lose the partial result.
The routing function route_after_agent reads budget_exhausted and returns "budget_exceeded" instead of "tools" or "end".
budget_exceeded_node appends a user-facing explanation and the graph terminates at END.

The agent does not loop again. No additional LLM calls are made. The graph state at termination is fully readable — you can log cost_usd, tokens_used, and the final messages for billing reconciliation.

For streaming graphs (using graph.stream()), the same routing applies. The stream will emit the budget_exceeded node's output as its last event before the stream closes.

Limitations of This In-Process Approach

This pattern works and is production-deployable, but it has real constraints you should know before you ship it:

No enforcement across runs. The per-run deduction only persists if you use Redis or Postgres. The in-process dict resets on restart. For daily or monthly budget caps, you need an external store with TTL-based resets.

Race conditions on concurrent requests. If the same user fires two requests simultaneously, both will read the same remaining budget before either deducts. Use atomic operations (Redis EVAL with Lua, Postgres SELECT ... FOR UPDATE) to prevent double-spend.

Token counts depend on the LLM integration. Not every LangChain model wrapper populates usage_metadata consistently. Test your specific model before relying on it. If usage_metadata is absent, fall back to a token-counting library like tiktoken for estimation. Note: tiktoken only covers OpenAI-family models — for Anthropic models, use anthropic.count_tokens() or estimate at 0.75 tokens/char with an overhead margin.

No cross-graph enforcement. If you have multiple graphs running for the same user (e.g., one for chat, one for background tasks), this approach tracks budgets separately per graph instance unless you centralize the store.

Tool execution costs are not tracked. The cost loop only reads usage_metadata from AIMessage responses. ToolMessage objects returned by tool nodes carry no token data, so any paid external API calls made inside your tool functions are not counted against the budget — track those separately at the tool level.

Frequently asked questions

Does this work with async LangGraph graphs?

Yes. Replace graph.invoke() with await graph.ainvoke() and wrap your node functions with async def. The budget wrapper and state updates are synchronous operations that work inside async node functions without modification.

What happens if the user's budget runs out mid-tool-call?

The budget is checked and deducted after each call_model node completes. If the model call that triggers a tool use is the one that exhausts the budget, the tool call is still executed (since it was already issued) but the graph routes to budget_exceeded before the next model call runs. No additional LLM charges accumulate after the limit is crossed.

How do I reset budgets monthly?

For the Redis implementation, set the key with a TTL matching your billing cycle: r.set(key, budget_usd, ex=seconds_until_reset). For Postgres, add a reset_at timestamp column and reset rows via a cron job or a scheduled background task.

Can I use this pattern with multiple LLM providers in the same graph?

Yes — the wrapper reads usage_metadata from any AIMessage, regardless of which provider generated it. Add each provider's per-token pricing to the MODEL_PRICING dict and the cost calculation works across providers automatically.

Conclusion

The pattern above gives you pre-flight budget enforcement inside LangGraph without touching the graph topology. Budget state travels as typed fields, the wrapper intercepts cost at registration time, and conditional routing exits cleanly when the limit is crossed. The core logic is around 80 lines of Python. The edge cases — concurrent requests, store resets, multi-graph attribution, non-OpenAI tokenizers — add complexity in proportion to your scale, not your initial implementation.

For teams that want per-user enforcement without owning that infrastructure, noburn.dev handles it at the infrastructure layer — tracking spend per user or project, blocking over-budget requests before they reach the model, and providing a dashboard without any state management in your graph. Start for free at noburn.dev/docs.