The math is simple: if a user generates 500,000 tokens per month and you pay $10 per million input tokens, that user costs you $5 in raw inference. At a $20/month plan price, gross margin on that customer is 75%. If they generate 2,000,000 tokens, you are underwater on them. Traditional SaaS did not have this problem. Hosting cost is roughly fixed whether users click ten times or ten thousand. LLM inference bills per token, and 2026 has surfaced enough real P&L data to show what healthy and unhealthy patterns actually look like.
The tooling around AI cost management has split into four distinct categories: cost observability (see what happened), model routing (use cheaper models intelligently), semantic caching (avoid redundant calls), and pre-flight enforcement (block calls before they fire). Each addresses a different failure mode in the unit economics, and profitable AI SaaS companies end up using two or three together.
Key numbers at a glance
- Healthy gross margin for AI SaaS is 55–70% at early stage, with 70%+ as the target once routing and caching optimizations compound. Traditional SaaS benchmarks are 70–85%; inference costs compress that range.
- LLM costs as % of revenue should stay below 15–20%. At $50/month pricing that implies a per-user inference budget of roughly $7–10 per month.
- Break-even token volume at $10/million input tokens on a $20/month plan is about 2,000,000 tokens per user per month. Beyond that the customer is unprofitable on inference alone.
- Semantic caching cuts repeat inference spend by 25–45% for products with predictable query distributions such as FAQ bots and customer support agents.
- Model routing to smaller models (10–20x cheaper on input tokens) can recover 10–40 gross-margin points if 60%+ of workload tolerates a smaller model.
Cost observability: Helicone, LangSmith, and Portkey
These tools sit in the request path or are called after the fact to log token counts, latency, and estimated cost to a dashboard. Helicone runs as a proxy between your app and the OpenAI or Anthropic API, capturing every request and annotating it with cost estimates based on the model and token counts. LangSmith integrates directly with LangChain and logs full traces, showing not just top-level cost but how much each chain step, tool call, and retrieval operation contributes. Portkey combines the proxy model with a gateway that supports model fallbacks and retries across providers.
The core value here is visibility. A common discovery is that one feature, usually anything with long context or agentic loops, accounts for 60-70% of total inference spend while serving 20% of users. You cannot make pricing decisions without that breakdown, and most teams flying blind on per-user costs during their first year end up re-pricing at a loss once usage data actually arrives.
The limitation is that observability is retrospective. You learn how much something cost after it happened. For unit economics, this is necessary but not sufficient: you still need to handle what happens when a user or project exceeds its budget. Observability tools will show you the overage on a dashboard. They will not prevent the bill from accumulating first.
Model routing and tiering: LiteLLM and OpenRouter
LiteLLM is an open-source proxy that translates between different LLM provider APIs and supports routing logic, load balancing, and fallbacks. You can configure it to route simple classification tasks to a cheaper model (GPT-4o-mini, Haiku) while sending complex reasoning to a frontier model, all behind a single OpenAI-compatible endpoint. OpenRouter is a hosted version of a similar concept with a marketplace of models and automatic least-cost routing.
The unit economics impact is concrete. As of late 2025, the price gap between frontier models and capable smaller models was roughly 10-20x on input tokens. If 60% of your workload can tolerate a smaller model, that compression flows directly to gross margin. Teams running agentic workflows with many tool-call loops report the largest savings here, since each loop step at frontier pricing adds up fast across thousands of daily users.
The challenge is that routing quality matters. A misconfigured router that sends complex tasks to a cheap model degrades output in ways that cost you customers rather than money. LiteLLM's routing requires manual calibration per use case. OpenRouter's automatic routing is convenient but opaque about how it chooses on your behalf.
Semantic caching: skipping calls you have already answered
Semantic caching intercepts incoming requests, embeds the prompt, and checks whether a semantically similar query has already been answered recently. If it has, the cached response is returned without hitting the inference API. Helicone has a built-in cache. GPTCache is a standalone open-source library with pluggable similarity thresholds. Redis with a vector extension is common for teams building their own.
Effective hit rate depends heavily on product category. FAQ bots and customer support agents with predictable query distributions see 30-50% cache hit rates in production. Coding assistants where every query is unique see near-zero. For products where caching applies, the economics can be dramatic: a cached response costs pennies in Redis lookups versus dollars in inference. Based on production data across teams using semantic caching in early 2026, customer support and Q&A products show 25-45% cache hit rates for semantically similar user queries. At $0.01 per query inference cost, a 35% hit rate on 100,000 monthly queries saves $350 per month.
Semantic caching introduces complexity. Staleness is a real problem for time-sensitive queries. Similarity thresholds require tuning: too strict and you miss valid cache hits, too loose and you return wrong answers to slightly different questions. Teams usually end up with per-feature cache configurations and ongoing threshold maintenance that consumes real engineering hours.
Budget enforcement: pre-flight vs post-call
The observability and caching tools assume you will examine usage data and make decisions afterward. Budget enforcement is different: it sits before the API call and either allows or blocks it based on whether the user or project has remaining budget.
The naive implementation is post-call accounting: count tokens after each response, deduct from a balance, and block future calls once the balance hits zero. The problem is that you can still overshoot because the call has already fired when you discover the overage. For agentic workflows that trigger at arbitrary times via webhooks or background jobs, a user at zero balance can still kick off a call before your accounting layer catches up.
Pre-flight enforcement estimates the token cost of the pending request before sending it to the model. If the estimated cost would exceed the remaining balance, the call is blocked at the SDK level and never leaves the client. This requires the client SDK to know token counts and model pricing, which is why it is implemented as an SDK wrapper rather than a server-side proxy.
AI SaaS tool comparison
| Tool | Category | Enforcement timing | Per-user metering | Self-host | Pricing |
|---|---|---|---|---|---|
| Helicone | Observability + caching | Post-call logging | Project-level only | Yes (open source) | Free (10k req/mo); Pro $50/mo; Team $200/mo |
| LangSmith | Observability + tracing | Post-call logging | Workspace-level | Enterprise | Developer free (5k traces/mo); Plus $39/mo; Enterprise custom |
| Portkey | Gateway + observability | Post-call logging | Virtual keys | Yes | Free (10k req/mo); Growth $49/mo; Enterprise custom |
| LiteLLM | Model routing + proxy | Post-call routing | Basic budgets via proxy config | Yes (open source) | Open source; enterprise pricing varies |
| OpenRouter | Hosted model routing | Post-call | Per API key | No | Per-token usage; no flat-fee tier |
| GPTCache | Semantic caching | Pre-call (cache check only) | No | Yes (open source) | Open source |
LLM Cost Management Gaps: Per-Customer P&L, Feature-Level Billing, and Mid-Job Enforcement
The observability tools have matured significantly, but three gaps remain that no single tool covers well.
Real-time per-customer P&L. You can see aggregate inference spend in Helicone or LangSmith. You cannot easily see, in real time, which of your 500 customers is currently profitable at the per-request level and which is underwater. This requires tying inference cost to subscription revenue per customer, which none of the observability tools do natively. Most teams build this in their own data warehouse with a weekly job that joins usage exports to Stripe data.
Passthrough billing at the feature level. Several platforms support charging end-users for their LLM usage via Stripe. What is missing is feature-level granularity: charging differently for a cheap summarization feature versus an expensive agentic research workflow run by the same user in the same product. The billing integrations that exist operate at the user level, not the feature level.
Enforcement within running agentic jobs. When a user triggers an agent that runs asynchronously over the next 20 minutes, budget enforcement gets complicated. The user may have had budget when the job was enqueued but exhaust it before the job completes. Tools that enforce at queue time are different from tools that enforce at each individual API call within a running job. This mid-job enforcement gap is largely unsolved.
FAQ
At what point should I switch from post-call accounting to pre-flight enforcement?
Post-call accounting is fine when your product's LLM calls are synchronous and user-initiated. It breaks down once you have agentic workflows, background jobs, or webhook-triggered calls where a user can kick off a request before your accounting layer has caught up with their current balance. If any part of your product can fire LLM calls outside of a direct user action in the request path, pre-flight enforcement is worth adding before you scale, not after an overage incident forces the issue.
How does per-user metering differ from project-level budget controls?
Project-level controls cap total spend across all users of a given integration key or workspace. Per-user metering assigns a separate budget to each end-customer and blocks or bills them individually when they exceed it. For multi-tenant SaaS, only per-user metering lets you enforce the unit economics on a per-customer basis — project-level caps just protect your aggregate bill without telling you which customers are profitable and which are not.
Should I charge per-seat or per-usage?
For AI-heavy products, pure per-seat pricing transfers the token cost risk entirely to you. A power user on a $50/month flat plan who generates 5x average inference volume erodes margin on that customer with no recourse. Usage-based or hybrid pricing (a seat fee plus per-token overage) passes some of that risk back to the customer. The practical answer for most early-stage products is to start with per-seat, collect real usage data, identify your power user distribution, and switch to hybrid before you scale past 200-300 paying customers.
What is the most common unit economics mistake early AI SaaS companies make?
Treating LLM costs as a rounding error during development and only measuring them seriously at scale. This leads to pricing decisions made without real cost data, which tend to look fine at 50 customers and break silently at 500. The second most common mistake is not segmenting usage data by customer: your average inference cost hides a distribution where a small percentage of customers account for a disproportionate share of spend, and knowing who those customers are before you price is the difference between a sustainable unit structure and one that punishes growth.
How does Stripe passthrough billing actually work in practice?
The product charges the end-user's payment method for their LLM usage, typically denominated in credits or tokens, rather than the SaaS company absorbing that cost as COGS. The practical implementation usually involves a credit wallet per user, a webhook from the inference layer that decrements the wallet after each call, and a Stripe charge triggered when the wallet runs low. The accounting treatment matters: passthrough billing converts variable COGS into revenue, which improves gross margin on paper, but only if the per-token price you charge customers exceeds what you pay the model provider plus infrastructure overhead.
Conclusion
Healthy AI SaaS companies target 60–70% gross margin by treating the 55–70% early-stage range as a floor, not a ceiling, and compounding routing and caching optimizations until they get there. The lever that keeps gross margin in that range is controlling LLM cost as a share of revenue — that number needs to stay below 15–20%, which at typical plan prices translates to a firm per-user inference budget. Observability tells you when those budgets are being breached; enforcement is what actually stops them from being breached, and it has to happen before the API call fires rather than after the overage has already landed on your bill.
For teams that want pre-flight enforcement without building the metering layer, noburn.dev wraps your existing OpenAI, Anthropic, LangChain, or LangGraph client and blocks over-budget calls before the token leaves your server — including a Stripe passthrough billing integration for charging end-customers at the feature level.