noburn.dev
← BlogJoin waitlist
openai assistants apiai agentsllm costapi pricing

OpenAI Assistants API vs Building Your Own Agent: True Cost at Scale

The Assistants API abstracts a lot of complexity. That abstraction has a per-token cost floor you cannot optimize away. Here is what the real numbers look like at 10k, 100k, and 1M calls per month.

nb
noburn.dev·2026-06-21

Introduction

The Assistants API promises a shortcut: thread management, file handling, code execution, and retrieval—all without building it yourself. But that convenience comes with a hidden cost structure that becomes impossible to ignore at scale. Neither token pricing nor file storage scales linearly with your usage patterns, and the abstraction locks you into OpenAI's infrastructure without meaningful cost controls.

This post maps the real numbers. We'll run the same agent workload through both approaches at 10k, 100k, and 1M calls per month, then show you what you're actually paying for—and what neither approach gives you out of the box.

OpenAI Assistants API

The Assistants API provides managed agents with built-in file storage, vector retrieval (via file search), and code execution. You create an assistant once, then send messages to it across multiple threads.

Pricing model: You pay per token (same as the completions API) plus per-gigabyte-day for file storage and vector store storage. There is no per-request fee for the API itself. However, the abstraction often requires additional API calls: retrieval happens inside the assistant, so you don't control whether it runs or how many tokens it consumes on each turn.

Key limitations:

  • File search and retrieval costs scale per vector operation, not per request. A single user message can trigger multiple internal retrievals.
  • Function calling syntax requires you to stay within OpenAI's tool format; integrating external APIs means more orchestration overhead.
  • No pre-flight cost enforcement. If a user's message triggers expensive retrieval or multiple tool calls, you discover the cost when the bill arrives.
  • Token usage is opaque within the assistant execution context, making it hard to set budgets per user or project.

Building your own agent

A DIY agent means you own the orchestration layer: you call the model directly, manage tool execution, handle retrieval yourself, and control every token that flows. You use libraries like LangChain, LangGraph, or raw OpenAI SDK calls.

Pricing model: You pay only for tokens used—no service markup, no storage fees unless you build your own vector database. Costs are predictable and linear with token count, but the operational cost is in engineering: prompt engineering, orchestration logic, error handling, and observability.

Key limitations:

  • You must build and maintain retrieval logic if your agent needs to search documents or knowledge bases.
  • Tool-calling orchestration is your responsibility. Handling fallbacks, retries, and tool-call failures requires custom logic.
  • You'll likely overshoot on token usage early on. A naive agent implementation can be 2-3x more expensive than an optimized one because you don't know what you're optimizing yet.
  • No budget enforcement at the API level. Cost control requires you to implement it yourself or layer in a third-party solution.

Comparison: cost at scale

Here's the real cost picture at three monthly volumes, assuming a typical conversational agent making one retrieval call and one tool call per user message:

MetricOpenAI Assistants APIDIY Agent (LangGraph)noburn.dev (pre-flight enforcement)
Pricing modelPer token + file storagePer token onlyPer token + per-request cap
10k calls/month$120–$180 (tokens) + $10–$30 (storage)$80–$120$80–$120 + cost limits enforced
100k calls/month$1,200–$1,800 + $100–$300$800–$1,200$800–$1,200 + per-user budget
1M calls/month$12,000–$18,000 + $1,000–$3,000$8,000–$12,000$8,000–$12,000 + spend capped before firing
Token visibilityOpaque (within assistant)Full controlEstimated pre-flight
Per-user budgetsManual trackingManual trackingAutomatic enforcement
Self-hosted optionNoYes (if using self-hosted LLM)No
Setup overheadDays2–4 weeksHours (SDK integration)

Costs assume GPT-4o usage, ~2,000 tokens per call average, file storage ~500MB. Actual numbers depend on your model, message complexity, and retrieval frequency.

The Assistants API gains on engineering overhead but loses on cost visibility and control. DIY gains on cost predictability but requires you to build cost enforcement yourself. Neither prevents runaway spend when a user's query unexpectedly consumes 10x the typical tokens.

The enforcement gap

Here's what both approaches are missing: pre-flight cost blocking. You see the bill after the call fires. By then, either you've exceeded budget, or you've been left guessing whether a query was safe to execute.

The Assistants API compounds this because token usage is hidden inside the assistant—you don't know how many tokens a retrieval will consume until it completes. A file search against 50,000 documents might use 5,000 tokens. A tool call might cascade into three more tool calls. You can set rate limits to slow things down, but that's not cost control; it's just delay.

DIY agents give you more visibility, but you still have to implement budget checks yourself. And even then, most implementations check spend after the call, not before. By the time you know you're over budget, the API call has already fired and the tokens are already consumed.

FAQ

Does the Assistants API actually cost more? Not necessarily. At small scale (under 50k calls/month), the difference is noise. But file storage compounds. If you're storing large documents or maintaining a large vector index, storage costs add 15–25% on top of token costs. For a DIY agent using the same models, you'd store vectors in a separate vector DB with its own costs (Pinecone, Weaviate), so it's not a clear win for either approach.

Can I use the Assistants API cost-effectively? Yes, if you're willing to trade engineering time for cost. The key is keeping file storage minimal—archive old files, delete unused ones, and size your vector index carefully. But you're still paying for every token the assistant consumes internally, and you have no way to set per-user spending limits.

Which approach is better for multi-tenant SaaS? DIY agents, because you can implement per-user token budgets and metering yourself. The Assistants API makes per-user cost control cumbersome since file storage is shared at the assistant level, not per user. You'd have to create per-user assistants to isolate costs, which multiplies your storage footprint.

What if I hit an unexpected cost spike? Assistants API: You discover it in your bill. DIY: You discover it in your logs. Neither approach gives you a kill switch at the API request level. That's where pre-flight cost enforcement changes the equation.

Should I migrate from Assistants API to DIY? Only if cost or per-user budgeting is a hard requirement. The Assistants API is simpler to start with and reduces initial engineering risk. DIY is cheaper at scale and gives you full control. The inflection point is usually around 100k calls/month.

Where noburn fits in this stack

noburn.dev addresses the gap both approaches leave open: it estimates token cost before the API call fires and blocks the request if the user or project has exceeded its budget. This works with OpenAI, Anthropic, LiteLLM, LangChain, and LangGraph, so you can layer it on top of whichever orchestration you choose—Assistants API or DIY agent.

For multi-tenant SaaS, noburn adds per-user metering so each customer has their own spending limit. Combine that with Stripe passthrough billing and you can charge customers for their LLM usage, with cost overages automatically capped. The free tier covers 50,000 requests per month. Documentation and SDKs are at noburn.dev/docs.