noburn.dev
← BlogJoin waitlist
open sourcellm gatewayself-hostedlitellm

Open Source LLM Gateways in 2026: What to Self-Host and What to Buy

LiteLLM, Ollama, and a half-dozen others offer self-hosted LLM routing. The operational cost of running your own gateway is higher than the license cost of a managed tool. Here is how to decide.

nb
noburn.dev·2026-06-14

An LLM gateway sits between your application and one or more model providers. It normalizes the request format so your code calls a single endpoint regardless of whether the model is GPT-4o, Claude, or a local Llama, and it handles the operational concerns that show up once you run more than one model in production: retries, fallback routing, key management, rate limiting, caching, and usage logging.

The category fragmented because those concerns pull in different directions. A proxy that routes 100 models well is not the same tool as a runtime that serves a quantized model on your own GPU, and neither one is built to enforce a per-customer spending limit. So the ecosystem split into routing proxies (LiteLLM, OpenRouter, Kong), local inference runtimes (Ollama, vLLM), and a managed layer that bundles observability and governance (Portkey, Helicone). Most teams end up running two or three of these together. This is a map of what each one actually does and where the line between self-hosting and buying falls in 2026.

LiteLLM

LiteLLM is the de facto open-source routing proxy. It exposes an OpenAI-compatible endpoint and translates calls to 100-plus providers, so application code written against the OpenAI SDK can target Anthropic, Bedrock, Vertex, or a local model by changing a config string. It runs as a Python library you import directly or as a standalone proxy server you deploy with a YAML config defining model groups, fallbacks, and budgets.

The proxy is where the operational weight lives. Running it in production means a Postgres database for keys and spend tracking, a Redis instance for distributed rate limiting and caching, and a deployment you patch and monitor. LiteLLM ships budget controls and virtual keys, but the spend tracking is computed after each call completes, so it tells you what was spent rather than blocking a call before it fires. For routing breadth and provider coverage, nothing open source comes close.

LiteLLM is MIT-licensed and free to self-host. There is also a paid enterprise tier (LiteLLM Enterprise) that adds SSO, audit logs, and support; check current pricing, since the enterprise tiers change. The decision point is staffing: the software costs nothing, but someone owns the Postgres, the Redis, the upgrades, and the on-call.

Ollama

Ollama is a local inference runtime, not a routing proxy, and it gets grouped into the gateway conversation because it also exposes an OpenAI-compatible API. It pulls quantized open-weight models (Llama, Mistral, Qwen, Gemma, and others) and serves them on your own hardware, CPU or GPU, with a single command. For local development, air-gapped environments, and workloads where data cannot leave your infrastructure, it is the simplest path to a running model.

What Ollama does not do is multi-provider routing, spend governance, or per-user metering. It serves the model in front of it. Teams pair it with a routing layer when they want one endpoint that falls back from a hosted frontier model to a local one, with LiteLLM or a similar proxy sitting in front. Treat Ollama as the inference engine in the stack, not the gateway.

It is free and open source. The real cost is hardware and the engineering time to size, quantize, and keep a model serving at acceptable latency under load, which is a meaningfully different skill set from running a proxy.

OpenRouter

OpenRouter is a hosted gateway, not a self-hosted one, but it belongs in any 2026 comparison because it solves the same routing problem without the operational burden. A single API key and an OpenAI-compatible endpoint give you access to hundreds of models across providers, with automatic fallback and a unified billing relationship so you are not managing separate accounts with each vendor.

The trade-off is that it is a closed, hosted service and a margin sits on top of provider pricing. You give up the control and data-residency guarantees of self-hosting in exchange for zero infrastructure. For teams that want breadth without running anything, it is the lowest-effort option. For teams with strict data governance or high enough volume that the per-token margin matters, the math pushes back toward self-hosting LiteLLM.

OpenRouter charges provider pass-through pricing plus a fee on credit purchases; check current pricing for the exact percentage, as it has changed over time.

Portkey

Portkey is a gateway plus an observability and governance layer. The gateway core is open source and self-hostable, and the managed product adds dashboards, logging, guardrails, prompt management, and budget controls on top. It targets the team that wants more than raw routing: tracing, analytics, and policy enforcement in one place.

Its budget and rate-limit features are more developed than a bare proxy, but the enforcement model is still built around observing and capping usage as calls flow through, with limits applied at the gateway. That is useful for org-level governance. It is a different thing from estimating a single request's cost before it executes and rejecting it on the spot.

Portkey has a free tier and paid plans; check current pricing. The self-hosted gateway is open source, while the full observability platform is the commercial product, so the build-versus-buy line runs through the middle of the product itself.

Helicone

Helicone is primarily an observability platform that operates as a proxy. You route calls through it and it logs requests, latency, token counts, and costs, with a dashboard for analysis and a caching layer. It is open source and self-hostable, and it is one of the easiest ways to get visibility into LLM spend after the fact.

Because it sits in the request path it can do gateway-adjacent things like caching and rate limiting, but its center of gravity is logging and analytics, not multi-provider routing or hard spend enforcement. Like the others in this group, cost data arrives once the call has completed. It answers "what did we spend and where" rather than "should this specific call be allowed to fire."

Helicone offers a free tier and usage-based paid plans, plus a self-hosted open-source option; check current pricing.

The comparison

ToolOpen sourceSelf-hostPrimary rolePre-flight enforcementPer-user meteringPricing
LiteLLMYes (MIT)YesRouting proxyNo — post-call budget trackingVirtual keys, post-callFree OSS; enterprise tier (check current pricing)
OllamaYesYesLocal inference runtimeNoNoFree; hardware cost only
OpenRouterNoNo (hosted)Hosted routingNoNoPass-through + fee (check current pricing)
PortkeyGateway core: yesGateway: yesGateway + observabilityNo — gateway-level capsOrg/key limitsFree tier + paid (check current pricing)
HeliconeYesYesObservability proxyNoNoFree tier + usage (check current pricing)
noburn.devSDK integrationsManagedPre-flight cost enforcementYes — blocks before the call firesYes — per end-customer limitsFree 100 req/mo; Early Bird $9/mo; Pro $49/mo

What the category is still missing

Every tool above measures cost after the API call returns. The proxy forwards the request, the provider runs the tokens, the response comes back, and only then does the gateway record what it cost and decide whether you have now crossed a budget. By the time a limit "triggers," the money is already spent. For an internal team watching an aggregate monthly number that lag is fine. For a multi-tenant SaaS where one customer can loop a prompt and burn through a month of margin in an afternoon, after-the-fact accounting is a report of damage already done.

The missing piece is pre-flight enforcement: estimating the token cost of a request before it executes and rejecting it when the user or project is already over budget, so the expensive call never reaches the provider. None of the routing proxies or observability tools are built around this, because their architecture assumes the call has to fire before its cost is known. The related gap is per-end-customer accounting. Virtual keys and gateway-level rate limits cap an API key or an organization, not an individual customer inside your application, and they do not feed clean per-customer usage into billing. If you charge customers for their own LLM usage, you need a hard limit and a metered number attached to each one, enforced before spend happens rather than reconciled after.

FAQ

Should I self-host LiteLLM or use a managed gateway?

Self-host LiteLLM when you need maximum provider coverage, full control over data residency, and you have engineers who can own a Postgres and Redis deployment plus upgrades and on-call. Buy a managed option when the staffing cost of running that infrastructure exceeds the license cost of a hosted tool, which for small teams it usually does once you account for the time spent maintaining it rather than building product.

Is Ollama an LLM gateway?

Not in the routing sense. Ollama is a local inference runtime that serves open-weight models on your own hardware and exposes an OpenAI-compatible API. It does not route across multiple providers or enforce spend, so teams typically run it behind a routing proxy like LiteLLM when they want one endpoint and local fallback.

Can LiteLLM block a request before it goes over budget?

Not before the call fires. LiteLLM tracks spend per virtual key and can stop further calls once a budget is exceeded, but the cost is computed after each request completes, so the call that crosses the threshold has already run and already cost money. Pre-flight enforcement that estimates cost and rejects the request before it executes requires a different layer.

What does it actually cost to run an open-source gateway?

The software is free, but a production deployment of a proxy like LiteLLM means a managed Postgres, a Redis instance, monitoring, and the engineering time to patch and operate it. At low volume that operational overhead routinely exceeds the monthly cost of a managed tool. Run the comparison on total cost including staff time, not license price.

How do I enforce per-customer spending limits in a multi-tenant app?

Gateway virtual keys and rate limits cap an API key or an organization, not an individual end-customer, and most enforce after the call. For real per-tenant control you need each customer mapped to a budget that is checked before the request fires, plus a metered usage number per customer you can pass into billing. That is the specific gap noburn fills.

The enforcement gap noburn fills

The tools in this article route, serve, and observe, and every one of them learns what a call cost after the provider has already run it. noburn works at a different point in the request lifecycle: it estimates the token cost client-side before the API call fires and blocks the request when the user or project is over budget, so the call that would blow the budget never reaches the provider. It plugs into the same stack you already run, with SDK integrations for OpenAI, Anthropic, LiteLLM, LangChain, LangGraph, and the Vercel AI SDK, so adding pre-flight enforcement does not mean replacing your gateway. For multi-tenant SaaS it adds per-user metering that enforces a separate limit for each end-customer and Stripe passthrough billing so you can charge customers for their own LLM usage from the same metered numbers. The free tier covers 100 requests per month. Documentation and SDKs are at noburn.dev/docs.