llm costretry logicerror handlingapi billinglangchainresilience

How Retry Logic Turns Small LLM Errors Into Large Bills

Exponential backoff is standard practice for HTTP APIs. For LLM APIs it can be a billing disaster. Here is how retry patterns interact with token costs and what to do instead.

nb

noburn.dev·2026-06-02

You deploy a chatbot powered by GPT-4o. A user asks a question. The API returns a 429 (rate limit error). Your code retries. The second attempt succeeds. You log it, move on. One logical request turned into two API calls, but the overhead seems small — both are the same prompt, so the cost doubled from $0.10 to $0.20. That's fine, right?

That is fine until rate limits spike at midnight, and 10% of your requests hit 429. One in ten becomes two API calls. A 50K-call night becomes 55K calls. Your LLM bill went up 10% overnight, and your retry logic is invisible — the code looks normal, the error logs show expected 429s, but the cost multiplied. This is the subtle part: retries are not just a reliability pattern. They are a cost amplification pattern.

The problem compounds when retries are nested. An agent calls a tool. The tool makes an LLM call. The LLM call times out and retries. Exponential backoff on both the agent retry and the inner call means the cost of a single agent step can multiply by 5 or 10x before the agent ever sees a final result.

This guide walks through how retry logic interacts with token costs, the most expensive failure modes, and the architectural patterns that prevent runaway bills.

The hidden cost of exponential backoff

Exponential backoff is industry standard: wait 1 second, then 2, then 4, then 8, then 16. The theory is sound: if the API is overloaded, backing off reduces load. The flaw is that it does not change the token cost.

A request retried after 16 seconds costs exactly as much as one retried after 1 second. Backoff delay changes when you pay, not whether you pay. For a $0.10 call retried 5 times, you pay $0.50 whether the backoff is 1 second or 16 seconds.

The real cost problem is frequency. If 10% of your traffic hits a transient error and you retry 5 times on each, you are not making 55K calls on a 50K night. You are making significantly more, because each retry is a full round-trip: prompt tokens, completion tokens, full billing.

Consider a realistic scenario:

Normal night: 50K calls, all succeed first try. Cost: $5,000 (at $0.10/call average).
High-load night with retries: 50K calls, 10% hit rate limits, each retried 3 times on average.
- 45K calls succeed first try: 45K calls, $4,500 cost
- 5K calls hit 429 and retry: 5K × 3 retries = 15K extra calls, $1,500 cost
- Total: 60K calls, $6,000 cost (a 20% increase)

That 20% increase is not a mistake or a known risk. It is a silent leak in the cost model.

Why the standard approaches fail

Approach 1: Lower the retry count

The first instinct when the bill arrives is to drop retries from 5 to 3. This helps linearly and misses the structural problem: three retries at three levels (agent, tool, LLM call) means 3 × 3 × 3 = 27 possible attempts. Tuning one number in one layer does not address the multiplication.

Approach 2: Add longer backoff delays

The second instinct is to add backoff: if the API is overloaded, wait longer before retrying. Backoff delay reduces request rate, which is good for API health. But it does not reduce cost. A request retried after 16 seconds costs the same as one retried after 1 second. For rate-limit storms, longer backoff can actually help because fewer concurrent retries hit the API at the same time. But the per-request token bill is unchanged.

Approach 3: Retry only on timeout

The third instinct is to catch the timeout specifically and retry only on timeout. This is closer to correct, but it ignores the most expensive failure mode: a streaming request that completes generation on the server and then times out on the client. The server has generated and billed every output token. Your client saw a timeout, treated it as failure, and retried. The server generated and billed the full output a second time. Without an idempotency key, you are billed twice for output you used once.

Approach 4: Observability and alerts

The fourth instinct is to add a dashboard and alert on cost. Log the cost of each API call, visualize retries, and alert when cost per call exceeds a threshold. This is genuinely useful for debugging and every team should do it. But it is detection, not prevention. The dashboard tells you about the 45x amplification after the tokens are spent. The 8:14 AM Slack alert is the system working as designed. The money is already gone.

The architecture that prevents runaway costs

Four changes, in order of impact.

1. Classify errors before retrying

The single most important rule: never retry a request whose failure is a property of the input rather than the transport. A 400 (malformed request), a 422 (validation error), a content-policy refusal, and a downstream validation failure will all reproduce on an identical prompt. Retrying them is wasted spend.

Retry only on:

Connection errors — network timeouts, socket resets
Proven incomplete requests — timeouts where you can prove the API did not complete the request
429 — rate limit
5xx — server error

Do not retry on:

4xx (except 429) — client errors; retrying will not fix them
Content policy violations — retrying the exact same prompt will hit the same violation
Validation errors — downstream service rejected the input; retrying will not fix it

def should_retry(error_code: str, error_message: str) -> bool:
    """Classify errors; retry only transient failures."""
    # Do not retry client errors
    if error_code in ['400', '401', '403', '404', '422']:
        return False
    # Do not retry policy violations or validation errors
    if any(phrase in error_message.lower() for phrase in 
           ['content_policy', 'violat', 'invalid', 'malformed']):
        return False
    # Retry transient errors only
    if error_code in ['429', '500', '502', '503', '504']:
        return True
    # Retry timeout only if we can prove no tokens were consumed
    if error_code == 'timeout' and not was_billed(request_id):
        return True
    return False

# Usage
for attempt in range(max_retries):
    try:
        response = client.chat.completions.create(...)
        break
    except Exception as e:
        if not should_retry(e.status_code, str(e)):
            raise  # Do not retry; propagate immediately
        if attempt < max_retries - 1:
            time.sleep(2 ** attempt)  # Exponential backoff for transient errors

This single change — classify errors before retrying — can cut retry costs in half.

2. Use idempotency keys to prevent double-billing

The most expensive retry scenario is a streaming request that completes generation on the server but times out on the client before the full response arrives. Here is the timeline:

Client sends request to API
API generates full response (1000 tokens) and bills $0.10
API starts streaming response back to client
Network issue or client timeout occurs mid-stream (client received 600 tokens)
Client sees timeout, assumes failure, and retries
API generates the same response again (1000 tokens) and bills $0.10 again
You have paid $0.20 for output you received once

Without an idempotency key, this happens silently. OpenAI, Anthropic, and other major providers support idempotency keys. Send the same key with a retry and the API returns the cached result without re-running the model.

import uuid
from openai import OpenAI

client = OpenAI()

def call_with_retry(user_prompt: str, max_retries: int = 3) -> str:
    idempotency_key = str(uuid.uuid4())  # Generate once per logical request
    
    for attempt in range(max_retries):
        try:
            # Send idempotency key with every attempt
            response = client.chat.completions.create(
                model="gpt-4o",
                messages=[{"role": "user", "content": user_prompt}],
                extra_headers={"Idempotency-Key": idempotency_key}
            )
            return response.choices[0].message.content
        except TimeoutError:
            if attempt < max_retries - 1:
                time.sleep(2 ** attempt)
                continue  # Retry with same key
            raise

result = call_with_retry("Explain quantum computing")

With idempotency keys, retries return the cached first result. One request, one bill, regardless of retries.

3. Cap total retry cost, not retry count

Instead of retrying a fixed number of times (e.g., 5 attempts), set a maximum total cost for all retries combined. If a single call costs $0.50 and you allow 5 retries, you could spend $3. If the first retry costs $2 and fails, do not retry again — you have hit your budget.

def call_with_cost_limit(
    call_fn,
    user_id: str,
    remaining_budget_cents: float,
    initial_call_cost_cents: float,
    max_retries: int = 5
) -> Any:
    """Retry a call, but stop if total retry cost exceeds budget."""
    spent = initial_call_cost_cents
    
    for attempt in range(max_retries):
        try:
            result, cost = call_fn()  # Call returns (result, cost_cents)
            return result
        except Exception as e:
            if not should_retry(e):
                raise
            
            # Estimate cost of next retry (usually similar to first call)
            estimated_next_cost = cost * 1.2  # Conservative estimate
            
            if spent + estimated_next_cost > remaining_budget_cents:
                raise BudgetExceeded(
                    f"User {user_id} has ${spent/100:.2f} spent on retries; "
                    f"next retry would cost ${estimated_next_cost/100:.2f}"
                )
            
            spent += estimated_next_cost
            time.sleep(2 ** attempt)

# Usage
budget_remaining_cents = 500  # User has $5 remaining this month
call_with_cost_limit(
    call_fn=lambda: llm_api_call(),
    user_id="user_123",
    remaining_budget_cents=budget_remaining_cents,
    initial_call_cost_cents=50  # First attempt cost $0.50
)

This ensures retries do not consume more budget than the original request.

4. Prefer batch API for non-urgent work

If the request does not need a response within seconds, batch API reduces cost by 50% and is retry-friendly because you are not polling a connection. You submit a batch, the API processes it at off-peak times, and you check for results later. If a job fails, retrying is cheap because you retry the entire batch, not individual calls.

Batch API is ideal for:

Document summarization (100s of documents)
Classification (large label sets)
Embedding generation (millions of vectors)
Scheduled report generation

from openai import OpenAI

client = OpenAI()

# Prepare batch requests
batch_requests = [
    {
        "custom_id": f"request-{i}",
        "method": "POST",
        "url": "/v1/chat/completions",
        "body": {
            "model": "gpt-4o",
            "messages": [{"role": "user", "content": f"Summarize: {doc}"}],
            "max_tokens": 500,
        }
    }
    for i, doc in enumerate(documents[:10000])  # Batch of 10K documents
]

# Submit batch (50% discount on input/output tokens)
batch = client.batches.create(
    input_file=batch_requests,  # File format varies by provider
)

# Check results later
status = client.batches.retrieve(batch.id)
if status.status == "completed":
    results = client.batches.retrieve(batch.id).output_file_id

Batch API costs 50% of on-demand pricing. A 50K-token request costs $0.05 instead of $0.10. For bulk work, this is the cheapest option.

Real-world example: The retry cost that went unnoticed

A team had a LangChain agent for customer support automation. The agent calls a knowledge base tool, which calls the OpenAI API. Both have retry logic.

The knowledge base was slow one morning (database replication lag). The agent's tool call timed out. The agent retried. The tool was still slow. The tool's internal LLM call timed out. The LLM retried. Both with exponential backoff.

One customer query resulted in:

Agent attempt 1 → tool call 1 → LLM call with 3 retries = 3 LLM calls
Agent attempt 2 → tool call 1 → LLM call with 3 retries = 3 LLM calls
Agent attempt 3 → tool call 1 → LLM call with 3 retries = 3 LLM calls
Total: 9 LLM calls for one query

A 10-call query cost $0.50. One retry loop cost $4.50. The customer got a helpful response. The cost was invisible in logs.

By end of the day: 5,000 customer queries × 9 calls = 45K LLM calls instead of 50K. Cost: $4,500 instead of $2,500. The retry logic cost an extra $2,000 that day.

The fix was three-part:

Classify errors — only retry timeouts and 429s, not all errors
Add idempotency keys — prevent double-billing on retries
Cap retry budget — stop retrying if cost exceeds remaining budget

After the fix, retries were rare and cheap.

Tools and integrations

Built-in retry handling

OpenAI Python SDK — Automatic retries with exponential backoff. Configure via max_retries parameter.
Anthropic Python SDK — Built-in retry logic. Handles 429 and 5xx by default.
LangChain — BaseCallbackHandler for custom retry logic. Or use the max_retries parameter on LLM objects.

Cost-aware retry libraries

Tenacity — General-purpose retry library with custom conditions. Can define retry conditions based on exception type.
Backoff — Simpler alternative to Tenacity. Good for one-off retry logic.
noburn — Cost-aware retry enforcement. Blocks retries if they would exceed user budget. Integrates with OpenAI, Anthropic, LangChain, and LiteLLM.

Where noburn fits

noburn prevents retry cost blowups by blocking retries before they fire. If a user has $10 remaining in their monthly budget, and a retry would cost $15, noburn blocks it. No retry, no cost.

It works by wrapping your LLM client:

from noburn import with_noburn_budget
from langchain.chat_models import ChatOpenAI

model = with_noburn_budget(
    ChatOpenAI(model="gpt-4o"),
    user_id="user_123",
    monthly_budget_cents=50000  # $500/month
)

agent = create_react_agent(model, tools)
# Retries that would exceed budget are blocked before the API call fires

The free tier covers 50,000 requests per month. Per-request pricing starts at $0.001. Documentation and SDKs are at noburn.dev/docs.

Key takeaways

Exponential backoff reduces request rate but not token cost. Retries double or triple spending even if backoff delays are long.
Nested retries (agent + tool + LLM) multiply costs exponentially. A 3-layer retry can cause 27x amplification in the worst case.
Classify errors before retrying. Retrying a 400 or content-policy error is wasted spend — it will fail the same way.
Use idempotency keys to prevent double-billing on timeouts. One request = one bill, even if it retries 10 times.
Cap retry cost, not retry count. Set a maximum total budget for all retries, not a fixed number of attempts.
For bulk work, use batch API. 50% cost reduction with no retries needed.