noburn.dev
← BlogJoin waitlist
llm pricingopenaianthropicgooglecost trendsper-token costapi pricing 2026llm api costtoken pricingopenai pricinganthropic pricing

LLM Pricing Trends in 2026: What Token Costs Look Like After 18 Months of Competition

LLM API pricing 2026: per-token cost benchmarks across OpenAI, Anthropic, Google, and Deepseek. See where prices stand and where they are likely to go.

nb
Editorial·2026-05-29

The LLM API market has split into at least three distinct tiers: frontier models for complex reasoning tasks, efficient mid-tier models that handle most production workloads, and a commodity inference layer built on open-weight models. GPT-4 launched in early 2023 at $30 per million input tokens. By mid-2026, a model outperforming it on most benchmarks costs under $3. That compression happened because Google, Anthropic, Mistral, Deepseek, and a wave of open-weight inference providers each had structural reasons to cut prices fast. The result is a fragmented market where choosing a model is now as much a cost-engineering decision as a capability one, and where the spread between the cheapest and most expensive options at similar quality levels has narrowed to a degree that makes provider lock-in harder to justify.

OpenAI: tiered pricing across four distinct segments

OpenAI has settled on a four-tier structure that spans roughly three orders of magnitude in per-token cost. At the top, the o-series reasoning models price output in the $40 per million tokens range, a number justified by compute-heavy chain-of-thought generation rather than raw capability alone. These are the models you reach for when a wrong answer is expensive, not when throughput matters. As of mid-2026, o3 is priced at $10 per million input tokens and $40 per million output tokens, while o4-mini brings reasoning capability to a much lower price point at $1.10/$4.40 per million tokens.

One tier below, the GPT-4.1 family is OpenAI's general-purpose flagship lineup. GPT-4.1 is priced at $2/$8 per million input/output tokens, GPT-4.1 mini sits at $0.40/$1.60, and GPT-4.1 nano pushes the floor to $0.10/$0.40. The spread between GPT-4.1 and GPT-4.1 nano on a per-output-token basis is 20x, which is a deliberate product decision: OpenAI wants the efficient tier cheap enough to dominate high-volume inference while keeping premium margins on complex reasoning work.

The practical implication for builders is that OpenAI's pricing now requires you to segment your workload explicitly. Routing everything through the same model is leaving money on the table in one direction or capability on the table in the other.

Anthropic API pricing 2026: compressing the capability-to-cost curve

Anthropic's most significant pricing signal has been moving capability downmarket rather than protecting premium model margins. The Claude 4 generation carries that pattern forward. Claude Opus 4, the frontier tier, is priced at $15/$75 per million input/output tokens — matching the price of Claude 3 Opus while delivering substantially stronger performance. Claude Sonnet 4.5 (model names subject to Anthropic's final release naming) sits at $3/$15 per million tokens, and Claude Haiku 4.5 brings the efficient tier down to $0.80/$4. Each step down the lineup represents roughly a 4–5x reduction in per-token cost with a predictable capability tradeoff.

The practical implication is that Anthropic now has credible options at every price point from high-volume summarization to deep reasoning work. For multi-turn applications with large context windows, Anthropic's prompt caching (which prices cached input tokens at a fraction of uncached) becomes a meaningful lever, effectively reducing real-world costs well below the headline per-token rate for conversation-heavy workloads.

Google Gemini API pricing 2026: subsidized inference at scale

Google's approach to LLM pricing makes more sense when you consider that Gemini inference is both a standalone API product and infrastructure for Google Search, Workspace, and Android. That means Google has structural incentives to push inference costs down that no pure-play AI lab shares. The Gemini 2.5 generation, current as of mid-2026, continues this pattern. Gemini 2.5 Flash is priced at $0.30 per million input tokens and $2.50 per million output tokens — aggressive for a frontier-class model. Gemini 2.5 Pro sits at $1.25 per million input tokens (for prompts up to 200k tokens) and $10 per million output tokens, making it price-competitive with Claude Sonnet 4.5 and GPT-4.1 while offering an extended context window.

The practical differentiator for Google isn't price alone but context window size: 1M-token contexts at Flash pricing changes the economics for document-heavy workloads where chunking and retrieval were previously the only viable approach. Flash's per-output rate of $2.50 per million tokens is low enough to make it a serious contender for any workload where latency and throughput matter more than peak capability.

Open-weight models and the commodity inference layer

Deepseek's December 2024 releases changed the conversation about where the price floor sits. Deepseek V3 was available via their API at approximately $0.14 per million input tokens (cache miss) and $0.28 per million output tokens, with cached input tokens an order of magnitude cheaper. Deepseek R1 brought competitive reasoning-model capability at roughly $0.55/$2.19 input/output, compared to OpenAI o3's $10/$40. A Chinese lab had produced models that benchmarked comparably to frontier Western models and priced them as though compute were nearly free.

Meta's Llama 3.x family, available via Together AI, Fireworks, Groq, and a dozen other inference providers, created a parallel commodity layer. The 70B model runs at prices in the $0.10-0.20 per million token range depending on the provider and SLA tier. Self-hosting on GPU rental platforms (Lambda Labs, Vast.ai, RunPod) pushes effective per-token costs lower still for high-volume use cases, though operational complexity shifts to the engineering team. Groq's LPU hardware introduced ultra-low latency inference at competitive prices, adding a third dimension beyond cost and quality.

Mistral's models occupy a middle ground: European-hosted, GDPR-relevant for some markets, with Mistral Large 2 priced around $3/$9 per million tokens. The open-weight models (Mistral 7B, Mixtral) are freely runnable, making Mistral's commercial API more of a managed convenience than the only access path.

The cumulative effect is that there is now a functioning spot market for inference. The same capability is available at wildly different prices depending on who you route to, and the engineering work to switch providers on a per-request or per-workload basis is decreasing as LiteLLM and similar routing layers mature.

LLM API Pricing Comparison: Mid-2026

All prices are as of mid-2026. Verify current pricing on each provider's documentation before building cost projections into production systems.

ProviderModelInput ($/1M tokens)Output ($/1M tokens)ContextSelf-hostableNotes
OpenAIGPT-4.1$2.00$8.001MNoGeneral-purpose flagship
OpenAIGPT-4.1 mini$0.40$1.601MNoStrong price-to-performance ratio
OpenAIGPT-4.1 nano$0.10$0.401MNoCheapest OpenAI option
OpenAIo3$10.00$40.00200kNoChain-of-thought reasoning, top tier
OpenAIo4-mini$1.10$4.40200kNoEfficient reasoning-class model
AnthropicClaude Opus 4$15.00$75.00200kNoFrontier tier; deep reasoning tasks
AnthropicClaude Sonnet 4.5$3.00$15.00200kNoModel name subject to Anthropic's final release naming; prompt caching reduces real cost
AnthropicClaude Haiku 4.5$0.80$4.00200kNoHigh-volume efficient-tier option
GoogleGemini 2.5 Flash$0.30$2.501MNoAggressive pricing for frontier-class quality
GoogleGemini 2.5 Pro$1.25$10.001MNoLong-context cost advantage
DeepseekV3$0.14$0.28128kWeights availableCache miss rates; cached input ~10x cheaper
DeepseekR1$0.55$2.19128kWeights availableReasoning-tier at fraction of o3 price
Meta / TogetherLlama 3.x 70B~$0.18~$0.18128kYesRates vary by inference provider
MistralLarge 2$3.00$9.00128kPartialOpen weights for smaller Mistral models

What LLM API billing is still missing

The pricing data above solves for what a model costs per token. It does not solve for what a specific user or tenant costs you per month, and it does not block calls before they fire. Every provider in this table bills you in arrears: the API call succeeds, the tokens are consumed, and the charge accumulates. If a user triggers an unusually expensive chain, or if a bug causes a loop, the bill arrives after the damage.

Multi-tenant SaaS products face a tighter version of this problem. If you are building an application where end-customers each consume LLM capacity, you need per-user attribution to understand which customers are profitable. Standard provider billing gives you aggregate usage across your API key. Splitting that into per-user cost requires either building custom metering infrastructure or routing every call through a layer that tracks it. Neither is trivial to build correctly, particularly when your application uses multiple models or multiple providers.

The other gap is passthrough billing. If you want to charge customers for their LLM consumption (marking it up or passing it through at cost), you need a Stripe integration that maps usage to invoice line items. Building this alongside pre-flight enforcement and per-user attribution is the kind of infrastructure that takes weeks and is not your core product.

Frequently asked questions

What is the cheapest LLM API in 2026?

For general-purpose tasks, GPT-4.1 nano ($0.10/MTok input) and Gemini 2.5 Flash ($0.30/MTok input) are the lowest-cost options from major providers. For open-weight inference, Deepseek V3 via third-party hosts runs under $0.10/MTok. The cheapest option depends on your context window, latency, and quality requirements — see the comparison table above.

Why did token prices fall so quickly between 2023 and 2026?

Three factors drove simultaneous pressure: competitive dynamics between well-funded labs with different cost structures (Google in particular has infrastructure advantages that justify aggressive pricing), open-weight model releases that established a credible self-hosting floor, and hardware improvements that reduced per-FLOP inference costs. Each lab cut prices partly to remain competitive and partly because the underlying compute cost had dropped enough to preserve margins even at lower rates.

Are reasoning models like o3 and Deepseek R1 worth the per-token premium?

It depends on the task. For coding assistants, legal document analysis, and multi-step planning problems, reasoning models reduce the number of turns needed and catch errors that cheaper models miss, which can make the higher per-token cost net-cheaper at the task level. For classification, summarization, or extraction at scale, they are almost never worth it. The o4-mini sits in a useful middle ground: reasoning-class quality at a price point close to GPT-4.1. The practical approach is to route by task type and measure both quality and cost empirically rather than applying one model to everything.

Has the price floor been reached, or will costs keep falling?

The commodity inference layer (open-weight models run on spot GPU capacity) is already priced close to hardware cost. Further reductions there will track GPU rental prices rather than lab pricing decisions. For frontier proprietary models, there is probably still room for another 40-60% reduction through 2027 as hardware efficiency improves and competition intensifies. The pricing gap between frontier and commodity tiers has historically closed as frontier capability migrates downward, and that pattern shows no sign of stopping.

Should I switch providers to reduce costs?

Switching for cost alone is rarely straightforward. Context window behavior, tool call formatting, and output consistency differ enough between providers that a swap requires re-testing your prompts and downstream parsing. The better approach is workload routing: keep your current provider for tasks where you've tuned prompts, and route new task types to cheaper models after benchmarking. LiteLLM and similar libraries make the mechanical part of this easier; the evaluation work is still on you.

How do I actually control LLM spend in a multi-tenant application?

Per-user cost attribution starts with tagging every API call with a user or tenant identifier, then aggregating by that tag. Most providers support metadata fields for this. The harder problem is enforcement: once you have per-user spend data, you need to act on it before the next call fires, not after the billing cycle closes. That requires a pre-flight check against a running spend total, which means your inference call path needs to consult a budget store before hitting the provider API. Building this correctly, with race-condition handling and consistent cost estimation across models, is not a small project.

Conclusion

Token prices have compressed by roughly 10x at the top tier since 2023, and the commodity inference layer has pushed the floor even lower. What has changed is that the pricing decision is now a cost-engineering problem, not a capability problem — the right model at the right tier for the right workload is worth more than chasing the lowest headline rate across the board.

Teams that have solved for model choice now face uncontrolled spend as the primary LLM cost risk. Spend controls, per-user attribution, and pre-flight enforcement matter more than which row of the pricing table you pick — because every provider in this table bills in arrears, with no mechanism to stop consumption before it exceeds a budget.

noburn.dev sits in front of those provider calls, estimates token cost client-side, and blocks the request if the calling user or project has hit its limit — before the API call fires. It wraps OpenAI, Anthropic, LiteLLM, LangChain, LangGraph, and the Vercel AI SDK without requiring a provider change. The free tier covers 50,000 requests per month. Documentation and SDKs at noburn.dev/docs.