The invoice from OpenAI arrived on the first of the month. The founder had 80 active users on a chat feature that had been live for six weeks. She expected around $400. The invoice said $4,100. Nothing had crashed. No abuse, no scraper, no accidental infinite loop. The feature had worked exactly as designed, for exactly the users it was built for. The problem was the design.
LLM cost per active user varies by two orders of magnitude depending on what you're building and how you build it. A poorly designed chat feature and a well-designed RAG search can look identical to the end user and cost $15/month versus $0.15/month per active user. The mechanisms behind that spread are specific and worth understanding before you scale.
The actual cost structure by feature type
Here are real cost breakdowns for five common AI feature types. Calculations use GPT-4o at $2.50/M input tokens and $10/M output tokens (mid-2026 estimate), and GPT-4o-mini at $0.15/M input and $0.60/M output (mid-2026 estimate). Claude Sonnet runs approximately $3/M input and $15/M output. Verify current rates at each platform's pricing page, as these have shifted multiple times in the past year.
Inline autocomplete. A user writes code and the IDE triggers completions on every keystroke pause. At 100 triggers per day with a 450-token average context and 80-token completion: with GPT-4o-mini, that's $0.007/day per user, roughly $0.20/month. Swap in GPT-4o for code quality and the same usage pattern runs $3.40/month. The model tier decision alone produces a 17x cost difference at identical usage volume.
RAG question-answering. A user queries a knowledge base. Vector retrieval is near-free; the LLM call carries retrieved context plus the question. A typical chunk configuration sends 3,000 tokens of retrieved context plus 500 tokens of prompt overhead. At 15 queries per day: with GPT-4o-mini, $0.009/day or $0.27/month; with GPT-4o, $0.14/day or $4.20/month. The key variable is chunk count. Retrieve 10 chunks instead of 3 and input cost scales proportionally.
Chat with conversation history. This is where the $4,100 invoice comes from. A standard chat implementation appends each message to a growing history array and sends the full array with every call. Message 1 sends 500 tokens. Message 30 in that same conversation sends 30 times the average message length as history. The cost per call grows linearly with conversation depth, and users who stay in the same conversation thread for days can hit token counts that look like errors.
For a user with 20 conversations per month, each 30 turns deep at 300 tokens per message average, the token cost accumulates quickly. Turn k of a conversation sends k×300 tokens of history, so a 30-turn conversation totals 300×(1+2+…+30) = 300×465 = ~140k input tokens. At 20 conversations: roughly 2.8M input tokens per user per month on GPT-4o, which is $7/month. Let one user run 200 conversations and the cost for that single user hits $70 — before a single output token is counted.
Document summarization. A user uploads a PDF. A 40-page business document runs about 20,000 tokens. With Claude Sonnet: 20,000 input tokens at $3/M (mid-2026 estimate) is $0.06, plus a 1,000-token summary at $15/M (mid-2026 estimate) is $0.015, totaling $0.075 per document. At 10 documents per month: $0.75/user. At 100 documents, which a power user in a legal or finance workflow can reach: $7.50/user.
Autonomous agents. An agent that books meetings, processes invoices, or writes and executes code makes multiple LLM calls per task. Each call carries prior tool outputs in context, so per-call token counts are high and they multiply across steps. A 20-step task with 8,000 input tokens and 1,000 output tokens per call on GPT-4o: 160k input ($0.40) plus 20k output ($0.20) equals $0.60 per task. Ten tasks per month comes to $6/user. Fifty tasks, which is a realistic number for someone using an agent daily, comes to $30/user.
Why the spread exists: three decisions drive almost all of it
Model tier. Frontier models cost 15-20x what small models cost per token. For features where quality matters but output length is short (classification, extraction, intent routing), small models often deliver 90% of the quality at 5% of the cost. The team that benchmarks GPT-4o-mini on their specific task before defaulting to GPT-4o saves significant margin.
Context accumulation. Every token in a prompt that is not necessary to answer the question is waste at scale. Conversation history accumulates silently. Retrieved chunks multiply with retrieval count. Agent tool outputs grow with task complexity. Prompt length is not a constant; it is a function of user behavior. Users will behave in ways you did not model.
User distribution, not average usage. Average cost per user is a misleading number. If 80 active users include 5 power users who generate 60% of the tokens, the p95 cost per user is what determines whether your economics hold. A feature that costs $0.30/month for a casual user can cost $15/month for a heavy user. Designing around the average and pricing for the average is how you end up with invoices that are 10x what you modeled.
Why standard approaches fail
Request-count rate limits do not capture cost. One API call can cost $0.001 if the context is small or $1.00 if the user pasted a 50,000-token document. Counting calls rather than tokens will not protect you from cost spikes.
Post-hoc monitoring tools such as Helicone, LangSmith, and Portkey log cost after the call executes. This is useful for debugging and optimization, but it cannot stop a call that has already fired. If a user runs an agent task that costs $8 and they were supposed to be capped at $5/month, the monitoring tool tells you afterward.
OpenAI account-level spend limits apply to the entire account, not per user or project. With 1,000 users sharing one API key, a monthly account cap of $500 means your five heaviest users can exhaust the entire budget before the other 995 make a single call.
Per-user limits enforced in feature code work only if every code path that triggers an LLM call includes the enforcement check. A new feature added by a new engineer that skips the budget check bypasses the entire system silently. Enforcement in middleware, applied before the call reaches the model, is more reliable than enforcement scattered across feature implementations.
The correct architecture
Pre-flight enforcement is the only mechanism that prevents overspend rather than measuring it after the fact.
The pattern: before executing any LLM call, estimate the token cost of the request from the model, input token count, and expected output length. Check that estimate against the user's remaining budget. Block the request if it would exceed the limit.
This requires a few things to work reliably: a consistent token estimation function that accounts for both the prompt you construct and the output you expect; a per-user budget ledger that updates after every call completes; and a single enforcement point in the request path that every feature routes through. The last part is the hardest. A project that enforces budgets inside each feature handler will drift as the codebase grows. New endpoints skip the check. Refactors break the accounting. The enforcement logic belongs at the proxy or middleware layer, not in feature code.
The estimated-token approach introduces a small margin of error because output length is not known before the call. In practice, over- and under-estimates average out across calls, and for budget enforcement the relevant question is whether the user is within a reasonable range of their limit, not whether cost is tracked to six decimal places. A 10% estimation buffer handles the variance.
Frequently asked questions
Why does conversation history cause such large cost spikes?
Because cost grows with conversation depth, not just message count. Turn k of a conversation sends k previous messages as context. A 30-turn conversation doesn't cost 30× one turn — it costs 1+2+…+30 = 465 context-turns worth of tokens. Users who leave long threads open for days are a qualitatively different cost profile than users who start fresh sessions.
Can rate limits substitute for per-user token budgets?
No. Rate limits count API calls, not tokens. One request with a 50,000-token document costs as much as 50 short requests. A per-user token or dollar budget is the only mechanism that accurately reflects inference cost regardless of how the workload is structured.
Should I charge per-seat or per-usage for AI features?
Per-seat pricing transfers token-cost risk entirely to you. A power user on a flat plan who generates 10× average token volume erodes margin with no recourse. Usage-based or hybrid pricing (seat fee plus token overage) shifts that risk back proportionally. The practical path: start per-seat, collect real usage data for 3-6 months, identify your power user distribution, then move to hybrid before scaling past 200-300 customers.
Conclusion
Pull the five categories together and the range is clear. Inline autocomplete with a tuned small model sits below $0.25/user/month for typical usage. RAG search, sized correctly, stays under $0.50/user/month. Document summarization scales with document volume and can reach $7-8/user/month for heavy workflows. Conversational chat without history management is the category with the widest tail — median users may cost $2-3/month while active users who keep long threads open push past $15-20/month. Agents are the most variable of all: a team member who runs an agent daily for a full month can generate costs in the $50-100 range before you notice the trend.
The founder who got the $4,100 invoice wasn't hit by a bug. She was hit by a design that let users accumulate unbounded conversation history, and 5 power users who stayed in threads for weeks. Understanding the cost structure by feature type is the first step. The second step is enforcement — knowing what each feature should cost per user and blocking calls that would exceed that ceiling before they fire.
For teams that need hard per-user spend caps without building that infrastructure, noburn.dev sets a ceiling per user, per run, or per project — and blocks before the model is called.