budget capLLMcost controlagentstutorial

How to Set a Hard Budget Cap on LLM API Calls in 2026

Three approaches to capping LLM spend before it happens — client-side limits, gateway enforcement, and pre-flight cost estimation. What each one prevents and where each breaks.

nb

noburn.dev·2026-05-24

The standard advice when an AI agent runs up a surprise bill is: "add a budget cap." This is correct. It is also underspecified. There are three meaningfully different ways to cap LLM spend, and they protect you against different failure modes.

The short answer: max_tokens prevents single-call explosions. Post-hoc alerts catch prolonged overspend. Pre-flight enforcement is the only approach that blocks cost at scale — especially for agents. The rest of this post explains why, with code.

Understanding the distinction matters most for agent workloads, where the failure mode is autonomous code making hundreds of calls without human review. A cap that looks solid against a normal chatbot can be bypassed by an agent that retries on errors, spawns subagents, or processes unexpectedly large documents.

The Three Approaches

1. Client-Side Token Limits

The simplest approach is passing max_tokens in the API call. Every major LLM provider supports this parameter. It caps the response length for a single call.

response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    max_tokens=1000,  # hard cap on output tokens
)

What this protects: runaway output generation on a single call. If your prompt triggers an unexpectedly verbose response, max_tokens truncates it at 1,000 tokens regardless.

What this does not protect:

Repeated calls. If your agent makes 500 calls with max_tokens=1000, the cumulative cost is 500 × 1,000 output tokens × model rate. The per-call cap does nothing to the total.
Input token cost. max_tokens only limits output. If you're sending large documents as context, the input cost isn't bounded.
Per-user limits. A single max_tokens value applies to every call equally — you can't set different caps for different users without additional logic.

Client-side token limits are a necessary baseline. They are not a cost enforcement mechanism.

2. Post-Hoc Spend Tracking with Alerts

Most observability tools — Helicone, Portkey (Production tier), LangSmith — provide dashboards that show you what you've spent. You can configure an alert when spend crosses a threshold, which triggers a webhook or email notification.

What this protects: prolonged overspend that you catch within the alert latency window. If your agent starts generating $20/day instead of $2/day, you get alerted within the polling interval and can disable the agent manually.

What this does not protect:

Fast bursts. An alert fires after the fact. If the agent loops 200 times in 4 minutes, the damage is done before your webhook handler runs.
Automated enforcement. An alert notifies a human. Shutting off the agent requires a separate manual action unless you build your own response pipeline.
Per-user granularity at self-serve pricing. Most observability tools attribute cost to API keys or teams, not to individual end-users of your product. Per-user enforcement typically requires enterprise tier pricing — see the breakdown in Helicone vs Portkey.

Post-hoc tracking is essential for visibility. It is not enforcement. The call has already been sent.

3. Pre-Flight Budget Enforcement

Pre-flight enforcement evaluates each call before it is sent. The enforcement layer:

Estimates the input token count for the pending request
Looks up the remaining budget for the requesting user or project
Computes estimated cost = input tokens × model rate
If remaining budget ≥ estimated cost → forwards the call
If remaining budget < estimated cost → rejects the call with a structured error

The API call never leaves your infrastructure if the budget is exhausted. The user has not spent anything.

import Noburn from '@noburn/sdk';

const nb = new Noburn({ apiKey: process.env.NOBURN_API_KEY });

// Pre-flight check — runs before the LLM call
const check = await nb.check({
  projectId: 'proj_abc',
  userId: 'user_123',
  model: 'gpt-4o',
  estimatedTokens: countTokens(messages),
});

if (!check.allowed) {
  return { error: 'Budget limit reached', remaining: check.remaining };
}

// Budget approved — make the call
const response = await openai.chat.completions.create({ model: 'gpt-4o', messages });

// Record actual cost after completion
await nb.record({
  projectId: 'proj_abc',
  userId: 'user_123',
  model: 'gpt-4o',
  inputTokens: response.usage.prompt_tokens,
  outputTokens: response.usage.completion_tokens,
});

What this protects:

Budget exhaustion. The call is blocked before any tokens are consumed.
Per-user limits. Each user has a budget. A high-consumption user cannot spend another user's allocation.
Fast bursts. Even if the agent makes 200 calls in 4 minutes, each one checks the remaining balance. The budget is depleted, then enforcement kicks in.
Cross-session cumulative spend. The remaining budget persists across sessions, not just within a single conversation.

What this requires:

Token estimation before the call (approximate but reliable within 5-10% for most models)
A spend ledger maintained somewhere — your database, or a managed service
Latency overhead for the pre-flight check (typically 2-15ms for a local database lookup)

Which Approach Covers Which Failure Mode

Failure mode	Client-side `max_tokens`	Post-hoc alerts	Pre-flight enforcement
Runaway single response	✓ Blocks	Detects after	✓ Blocks
Agent retry loop (100 calls)	✗ Misses	Detects after	✓ Blocks
Large document context (500k input tokens)	✗ Misses	Detects after	✓ Blocks
Per-user budget isolation	✗ Misses	Partial	✓ Enforces
Simultaneous multi-user overrun	✗ Misses	Detects after	✓ Enforces
3am autonomous agent spike	✗ Misses	Detects after (slow)	✓ Blocks

The pattern is clear: client-side limits protect against single-call runaway. Post-hoc tracking detects problems after money is spent. Pre-flight enforcement is the only mechanism that actually blocks cost at scale.

Implementing Pre-Flight Enforcement Without a Managed Service

If you're building this yourself, the core components are:

Token counting. Most model providers expose tokenizer libraries. For OpenAI models, tiktoken is the reference implementation — it uses the same BPE encoding as the API, so estimates are reliable within 1-2%. For Anthropic, use their Python library's token count method. Current model pricing is published at platform.openai.com/docs/pricing. The estimate doesn't need to be perfect — you're estimating before the call, so a 5% margin is acceptable.

import tiktoken

def estimate_tokens(messages: list[dict], model: str = "gpt-4o") -> int:
    enc = tiktoken.encoding_for_model(model)
    total = 0
    for msg in messages:
        # 4 tokens overhead per message in the OpenAI format
        total += 4 + len(enc.encode(msg["content"]))
    return total

Spend ledger. A simple Postgres table works. You need: user_id, current_spend (decimal), budget_limit (decimal), reset_at (timestamp for monthly resets). Use a transaction that reads current_spend + updates it atomically to prevent race conditions under concurrent calls.

-- Check and reserve atomically
UPDATE user_budgets
SET current_spend = current_spend + $estimated_cost
WHERE user_id = $user_id
  AND current_spend + $estimated_cost <= budget_limit
RETURNING user_id, current_spend, budget_limit;
-- 0 rows affected = budget exceeded

Reconciliation. The estimated cost will differ from the actual cost. After each call completes, you have the real token counts from usage.prompt_tokens and usage.completion_tokens. Update the ledger with the actual cost and release or charge the difference.

The complexity grows fast. Handling model pricing changes, supporting multiple providers with different token rates, managing budget resets, providing a spend dashboard, generating invoices for end-users — each of these is a distinct engineering project. For a full breakdown of what building this stack actually involves, see Per-User LLM Billing: The Gap Nobody Has Filled.

A Common Mistake: Confusing Rate Limiting with Budget Enforcement

Rate limiting rejects calls when a count threshold is crossed. Budget enforcement rejects calls when a spending threshold is crossed. They sound similar but behave very differently.

A rate limit of 100 requests/day costs very different amounts depending on model selection. 100 calls to gpt-4o with 10k context each costs roughly $50/day. 100 calls to gpt-3.5-turbo with the same context costs roughly $0.30/day.

Rate limits are useful for preventing abuse and protecting system stability. They are not a cost control mechanism unless you can accurately predict the cost per request — which agents generally make impossible.

Frequently Asked Questions

What is a hard budget cap for LLM API calls?

A hard budget cap stops API calls from being sent once a spending threshold is crossed. Unlike a soft cap (which alerts you after the fact), a hard cap rejects the request before it leaves your server — meaning no tokens are consumed and no cost is incurred. Hard caps require pre-flight enforcement: estimating call cost before sending and checking remaining budget.

Does max_tokens act as a budget cap?

No. max_tokens limits the length of a single response but has no effect on total cumulative spend across multiple calls. An agent that makes 1,000 calls with max_tokens=500 can still run up a significant bill. max_tokens is a per-call output limiter, not a cost enforcement mechanism.

How do I set a per-user budget cap for my AI SaaS?

You need three things working together: (1) a spend ledger that tracks each user's cumulative spend, (2) a pre-flight check that queries the ledger before each LLM call and rejects calls that would exceed the limit, and (3) a reconciliation step that updates the ledger with actual token counts after each call completes. Building this in-house takes roughly 2-3 weeks. For self-serve managed enforcement, noburn.dev provides all three as a drop-in SDK.

What is the difference between rate limiting and budget enforcement in LLM tools?

Rate limiting rejects calls when a request count threshold is crossed (e.g. 1,000 calls per day). Budget enforcement rejects calls when a dollar threshold is crossed. Rate limits don't account for variable call costs — a rate-limited plan can still run up unpredictable spend if expensive models or large contexts are used. Budget enforcement is the appropriate mechanism for cost control; rate limiting is the appropriate mechanism for abuse prevention.

Which observability tools support hard budget caps at self-serve pricing?

As of 2026, neither Helicone (maintenance mode) nor Portkey (enforcement is Enterprise-only) enforces hard budget caps at self-serve price points. LangSmith and Braintrust don't offer enforcement at any tier. For pre-flight enforcement without enterprise pricing, see noburn.dev.

Summary

All three approaches have their place in a production AI application:

max_tokens on every call — baseline. Prevents single-call output explosions.
Post-hoc tracking + alerts — visibility layer. Detects anomalies, provides audit trail.
Pre-flight enforcement — enforcement layer. Blocks spend before it happens.

For deterministic workloads with predictable token counts, post-hoc tracking with fast alerting is often sufficient. For agent workloads — any code that runs autonomously, makes multiple calls, or processes variable-size inputs — pre-flight enforcement is the only mechanism that actually prevents the overage.

The bad news: pre-flight enforcement is the hardest to build and the rarest to find as a self-serve product. The good news: you don't have to build it yourself.