noburn.dev
← BlogJoin waitlist
cost controlLLMagents

Why LLM Cost Control Is the Problem Nobody Talks About

Every AI startup obsesses over model quality. Almost none have a plan for when an agent runs away with a $2,000 API bill overnight.

nb
noburn.dev·2026-05-18

Most AI startups have a model quality problem. Some have a latency problem. Almost none have a cost problem — until they do, and then it's the only problem.

The overnight surprise

Here's a pattern that plays out every few months on Hacker News: a solo founder ships an AI feature, goes to bed, wakes up to a $4,000 OpenAI invoice. The agent hit an edge case, entered a retry loop, and nobody was watching. The bill gets paid, the feature gets a hard cap bolted on afterward, and everyone moves on.

The cap is always added after the incident. Never before.

Why this happens

LLM APIs are billed by token — and token counts are not immediately obvious at the call site. When you write:

response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
)

You have no idea what that line costs until the invoice lands. If messages contains a large conversation history, a document the user pasted, or a prompt that grew unbounded through an agentic loop — you could be spending $0.001 or $0.80 on a single call.

Multiply by concurrent users. Multiply by a retry loop. Multiply by a weekend nobody was watching.

The observability trap

The standard advice is "add observability." Tools like Helicone, Langsmith, and Braintrust will show you exactly what you spent — after the fact. They're useful for understanding your costs. They're useless for preventing a runaway agent.

Observability is a rearview mirror. You need a governor.

What a governor looks like

A governor intercepts before the call fires. It estimates the token cost client-side from the request itself (the model, the messages array, any max_tokens cap), then checks that estimate against a budget. If the budget is exhausted, the call never goes out.

from noburn import BAARRouter

router = BAARRouter(
    client=openai_client,
    budget_usd=10.0,
)

# This raises BudgetExhausted if the estimated cost would exceed $10
response = router.chat.completions.create(
    model="gpt-4o",
    messages=messages,
)

No API call. No tokens sent. No invoice.

The key word is before. The damage from an LLM overspend isn't the tokens you process — it's the tokens you don't catch before they leave your server.

Per-user budgets matter more than project budgets

A project-level budget catches the worst case. A per-user budget catches the realistic case.

If your product has 100 users and one of them triggers a 200-iteration agentic loop, a project budget of $50/day won't save you. That one user will consume your entire budget before anyone else gets a response.

Per-user metering lets you set a reasonable limit per account — say, $1/day — so a single misbehaving session can't burn the runway for everyone else.

The right time to add cost control

Before you launch. Not after your first incident.

Cost control is boring infrastructure. It's not a feature users ask for. It won't show up in your changelog. But it's the difference between a sustainable AI product and one that bleeds money every time someone finds an edge case.

Add the governor now, while it's cheap to do so.