noburn.dev
← BlogJoin waitlist
ragretrieval augmented generationllm costvector database

The Real Cost of RAG Applications in Production

RAG architectures have two cost centers that naive estimates miss: embedding generation and large context windows at retrieval time. Here is what production RAG actually costs per query.

nb
noburn.dev·2026-06-14

A reranker lets you retrieve broadly for recall, then keep only what earns its place in the prompt. Cutting context from 6,000 to 2,500 tokens on the earlier example takes the query from $0.021 to about $0.010, a 52% reduction, with retrieval quality often improving because the model sees less noise.

2. Trim the fixed overhead.

The 600-token few-shot block and the 400-token system prompt are paid on every single query, forever. Move stable instructions into the model's prompt-caching mechanism where available, which discounts repeated prefix tokens, and audit whether the few-shot examples are still earning their cost after you have real traffic to fine-tune against.

3. Put a budget gate in front of the generation call.

This is the part the architecture diagrams leave out. Before the expensive generation call fires, estimate its token cost and check it against a budget for that user, tenant, or project. If the request would push them over, block it and return a graceful error instead of an answer. This is the only mechanism that bounds the tail.

Where noburn fits

The tools compared in this article handle observability, routing, or evaluation — all of which operate after the LLM call completes. noburn operates before it. It wraps your existing OpenAI, Anthropic, LangChain, and the Vercel AI SDK client, estimates the token cost of each call, and blocks it if the calling user or project has exceeded their budget. Nothing in this comparison does that at a self-serve price point.

Per-user metering lets you enforce separate limits per end-customer, and Stripe passthrough lets you bill them for their LLM usage without writing a billing layer yourself. The free tier covers 100 requests per month. Documentation and SDKs are at noburn.dev/docs.