prompt cachingclaudegpt-4ocost optimization

Prompt Caching in Claude and GPT-4o: Real Cost Savings and How to Use It

Anthropic and OpenAI both ship prompt caching that can cut costs by 50-90% on repeated system prompts and context. Here is what caching actually covers and how to structure prompts to benefit from it.

nb

noburn.dev·2026-06-16

On the first call you see cache_creation_input_tokens populated (you paid the write surcharge). On every call within the TTL you see cache_read_input_tokens, billed at the 0.1x rate. If cache_read_input_tokens stays at zero across repeated calls, your prefix is changing and the cache never matches.

To extend the lifetime from the default 5 minutes to 1 hour, set the TTL explicitly:

Where noburn fits

The tools compared in this article handle observability, routing, or evaluation — all of which operate after the LLM call completes. noburn operates before it. It wraps your existing OpenAI, Anthropic, LangChain, and the Vercel AI SDK client, estimates the token cost of each call, and blocks it if the calling user or project has exceeded their budget. Nothing in this comparison does that at a self-serve price point.

Per-user metering lets you enforce separate limits per end-customer, and Stripe passthrough lets you bill them for their LLM usage without writing a billing layer yourself. The free tier covers 100 requests per month. Documentation and SDKs are at noburn.dev/docs.