Your OpenAI invoice arrived this morning: $4,200 instead of the usual $600. Your traffic logs show the same request volume as last month. Your error logs are clean. One user, one model, one API, but somehow the bill 7x'd overnight. How do you find which call did it?
The first instinct is panic. The second is to check the UI. OpenAI's billing dashboard shows daily spending but not the calls that drove it. You see that May 15 cost $300 and May 16 cost $1,100, but no breakdown of which prompt, which user, or which API call caused the jump. The dashboard is an audit trail, not a debugging tool. It answers "how much?" but not "why?"
This is a structural problem in how API providers expose cost data. They show aggregated spend per day and per model, but most teams need cost per request, per user, and per feature. The gap between what the provider shows and what you need to debug is where cost spikes hide.
Why cost spikes happen and why they are hard to find
Before diving into the debugging process, understand the root causes. Cost spikes in LLM applications come from a few sources:
- Input token amplification — A user submits a 100-page document instead of the usual 2-page request. Context window doubles; cost per call multiplies.
- Retry storms — Network issues or rate limits trigger exponential backoff retries. One logical call becomes 5 API calls. See our deep dive on retry logic.
- Agentic loops — An agent keeps calling tools because it is not satisfied with the first result. The loop that was supposed to run 2-3 times runs 10+. Each iteration is an API call.
- Model selection changes — A silent code deploy switches from
gpt-4o-mini($0.15/$0.60 per million) togpt-4o($2.50/$10 per million). All subsequent calls cost 10-15x more. - Batch job leaks — Someone runs a bulk operation on production credentials by accident. 100K rows × $0.05/call = $5K in 20 minutes.
The challenge is that these causes leave different signatures in your logs. An input spike looks different from a retry storm, which looks different from a code change. The debugging process must account for all of them.
The systematic debugging approach
The method requires three types of data: API call logs, user activity logs, and understanding the pricing structure. Most teams have one or two of these. The goal is to cross-reference them to narrow down the culprit.
Step 1: Isolate the date and model
Start with the invoice. OpenAI's billing API (and similar endpoints in Anthropic's dashboard or LiteLLM's proxy logs) shows aggregated spend per day and per model.
The spike might be concentrated or spread:
- Concentrated spike (single day or single model): The cause is usually one bad call, one misbehaving user, or a code deploy that went wrong. Look for the exact hour the spike began.
- Spread spike (gradual increase over multiple days): The cause is systemic — a prompt template that got longer, a retry count that increased, or a new feature that uses more tokens.
Cross-reference with your error logs. A surge in 429 (rate limit) or 5xx errors correlates with retry storms. A sudden increase in request volume (even if successful) correlates with a feature change or user behavior shift.
For concentrated spikes, note the exact timestamp. This is your search window for the next steps.
Step 2: Query your request logs for that time window
If you log every LLM API call (and you should), filter by the spike window. Most teams log to a database (Postgres, MongoDB), a log aggregator (DataDog, CloudWatch, Splunk), or a vector database (Pinecone, Weaviate).
A typical query:
SELECT user_id, model, SUM(prompt_tokens + completion_tokens) as total_tokens,
COUNT(*) as num_calls, AVG(prompt_tokens + completion_tokens) as avg_tokens_per_call
FROM llm_calls
WHERE created_at >= '2026-05-15 00:00:00' AND created_at < '2026-05-16 00:00:00'
GROUP BY user_id, model
ORDER BY total_tokens DESC
LIMIT 20This shows which user and model consumed the most tokens that day. If one user went from 50K tokens to 5M tokens, you have found your primary culprit.
Now ask: did the user submit more requests (num_calls spike), or fewer requests with longer tokens (avg_tokens_per_call spike)? This tells you what happened:
- Spike in num_calls: More requests. Maybe a loop or a batch operation.
- Spike in avg_tokens_per_call: Longer requests. Maybe a large document or context window change.
- Both: Possible retry or exponential growth.
Step 3: Drill into that user's individual calls
Once you have narrowed to a user and time window, pull individual calls:
SELECT created_at, model, prompt_tokens, completion_tokens,
total_cost_estimate, input_text_summary, error_code, retry_count
FROM llm_calls
WHERE user_id = 'user_xyz' AND created_at >= '2026-05-15 00:00:00'
ORDER BY total_cost_estimate DESC
LIMIT 50Look for outliers and patterns:
- Single call with 2M tokens: The model generated a very long output (streaming enabled and did not stop?), or context was set incorrectly. Check the input length too — it might be larger than expected.
- Series of 100 calls with 50K tokens each: An agent loop is firing repeatedly, or a retry storm is amplifying calls.
- Many calls with high retry_count: Retries are doubling or tripling the cost. See how retry logic causes cost blowups.
- Calls with error_code = 'context_length_exceeded': The prompt was longer than the model supports. It got truncated, and the API charged for the full length anyway.
This drill-down usually identifies the culprit within a few queries.
Step 4: Check code changes
If the spike is not user-driven (the user's traffic looks normal but tokens jumped), it is a code change. Check what deployed around the spike timestamp:
git log --since="2026-05-15 00:00:00" --until="2026-05-16 00:00:00" \
--oneline -- src/llm/ src/agents/ src/prompts/Look for changes to:
- System prompts: Longer prompts add tokens to every call. A 2K-token system prompt change multiplies across thousands of calls.
- Few-shot examples: More examples = more context = more cost per call.
- Model selection: A silent switch from
gpt-4o-minitogpt-4omultiplies cost by 10x. - Retry logic: Increased retry count or more aggressive backoff.
- Agent definitions: New tools, longer tool descriptions, or loops that run until success instead of first attempt.
- Context window or max_tokens changes: Setting
max_tokens: 4000instead of1000changes output cost.
A single line change — e.g., adding a 2K-token system prompt to the base agent — can multiply across thousands of calls if deployed on a production agent.
Step 5: Estimate the math to verify
Once you have identified the cause, estimate the expected cost and cross-check against actual cost:
If a user made 1,000 calls with:
- Average 5,000 prompt tokens (per call)
- Average 3,000 completion tokens (per call)
- Using GPT-4o: $2.50 per million input tokens, $10 per million output tokens
Expected cost:
Input: (1,000 × 5,000) / 1,000,000 × $2.50 = $12.50
Output: (1,000 × 3,000) / 1,000,000 × $10.00 = $30.00
Total: $42.50If the actual cost is $500 for that user, there is still a 12x gap. The causes:
- Longer outputs than estimated — Check the 95th and 99th percentile output length for that user. If they regularly generate 20K-token responses, your estimate was off.
- More calls than counted — Your logging is incomplete. Middleware or retries are not being logged.
- Token counting differences — OpenAI counts tokens slightly differently than
tiktoken. Streaming responses count differently from buffered responses. - Batch or cached pricing — You might be using Batch API (50% discount) or Prompt Caching (90% discount on repeated prefixes), reducing expected cost.
Cross-reference all three to find discrepancies.
Real-world example: The invisible retry storm
A platform built with LangChain was seeing stable costs of $1,500/month on 50K agent calls. Then costs jumped to $6,200/month.
Debugging:
- Date isolation: Spike occurred on June 10 around 2 PM UTC.
- User query: One user (
user_789) accounted for 40% of that day's tokens, but only 2% of requests. They had submitted a 400-page document. - Call drill-down: That user's calls showed high
retry_count(average 5 retries per call). The knowledge base had been slow that day, triggering timeouts. - Code check: No code was deployed that day. The slowness was external.
- Math: 50 calls × 5 retries each = 250 total LLM calls for that user. At $0.10 per call, that is $25 instead of the expected $5. Multiply by all users that day, and the spike explained itself.
The fix was not code. It was better error handling in the knowledge base query to avoid timeouts and excessive retries. See our guide on retry costs for prevention.
Tools to streamline debugging
Manual queries work but are slow. Consider:
- Spreadsheet exports: Export the cost spike date from the provider dashboard, use pivot tables to group by model and estimate per-call cost.
- Observability platforms: LangSmith and Langfuse log every LLM call with token counts and trace trees. Both have cost filters.
- Custom dashboards: Query your logs into a Grafana or Datadog dashboard grouped by user, model, hour, and cost. This makes spikes obvious.
- Alerts: Set up alerts for cost per user per day or cost per call. Alert when a single call exceeds $0.50 or $1.00, depending on your workload.
The best teams have cost visibility baked in, not bolted on after the spike.
Prevention: why debugging is too late
Debugging finds the culprit after you overspend. Prevention stops the overspend from happening.
As we cover in our LangChain cost control guide, the best approach is to set per-user, per-project, or per-feature budgets and enforce them before calls fire. This means:
- No call can exceed user's remaining budget
- Retries are blocked if they would exceed budget
- Large documents or agentic loops are capped
- The spike is prevented, not debugged
noburn integrates with your OpenAI, Anthropic, or LangChain client and blocks calls that would exceed budget before they hit the API. This eliminates the debugging cycle entirely because the overspend never happens.
The free tier covers 50,000 requests per month. For details, see noburn.dev/docs.
Key takeaways
- Cost spikes hide in aggregated data. Drill down to user and call level.
- Look for signatures: input spikes, retry storms, code changes, or agentic loops. Each has a different cause and requires a different fix.
- Token counting differences matter. Tiktoken and OpenAI's official counter differ slightly. Account for this in your estimates.
- Cross-reference logs, code changes, and error patterns. One source rarely tells the whole story.
- Prevention beats debugging. Enforce budgets before calls fire, not after.