The calculator separates costs into two fundamentally different categories. This distinction matters because one type is calculated from published provider rates, and the other depends on your own infrastructure choices.
API / Usage costs
Calculated · Verifiable
Charged by AI providers per token, per API call, or per page. Rates come directly from provider pricing pages. You can verify them independently. Includes: LLM inference, search tool calls, embedding generation, OCR.
Operating overhead
Your estimates · Not calculated
Costs you supply based on your own infrastructure. The calculator cannot calculate these for you — they depend on your hosting, database, and vendor choices. Includes: server hosting, vector database, monitoring tools.
Everything in the calculator builds on three foundational ideas. Understanding how they nest inside each other is the key to reading the results accurately.
Token
The unit AI providers use to measure text. Roughly ¾ of a word in English — "ModelPricing" is 3 tokens, a 1,000-word article is roughly 1,300 tokens. Providers charge separately for input tokens (text sent to the model) and output tokens (text the model generates). Output tokens cost significantly more because they require more computation.
Workflow
One complete task a user triggers — e.g. "analyse this contract", "book an appointment", "qualify this lead". A workflow might involve multiple model calls, tool lookups, and document reads. It is the natural unit for measuring usage volume: how many tasks does your system run per month?
Model call
A single API request to the language model. One workflow often requires several calls — for example: extract data → reason about it → generate a reply → check the output. Each call consumes input and output tokens and is billed separately. Multi-step agents typically have 2–5 calls per workflow.
Retry overhead
Not all calls succeed first time. Some fail due to rate limits, JSON parse errors, or model refusals — triggering an automatic retry. Retry overhead % adds a fraction of extra calls on top of your expected call count. A 5% retry rate means 5 extra calls per 100 attempts.
These nest like this:
This is the core charge — what you pay the AI provider for each model call. Providers charge per million tokens, with separate rates for input and output. The full formula, before cache and batch adjustments:
Rates are in USD per 1 million tokens. For example, Claude Sonnet 4.6 charges $3 per million input tokens and $15 per million output tokens. A call with 2,000 input tokens and 500 output tokens costs: (2,000/1M × $3) + (500/1M × $15) = $0.006 + $0.0075 = $0.0135.
Two provider features can significantly reduce LLM costs. The calculator applies both as multipliers on top of the base inference cost.
Cache hit rate %
When the same system prompt or context appears in repeated calls, providers can cache the input and charge a fraction of the normal input rate (typically 10–20× cheaper). The cache hit rate is the percentage of your input tokens that get served from cache rather than reprocessed. A customer support agent with a fixed knowledge base might see 30–50% cache hits; a general assistant with unique per-user context might see 5–15%.
Batch share %
Providers offer a batch API for non-real-time work — jobs submitted in bulk and completed within hours rather than seconds. Batch pricing typically gives a 50% discount. The batch share is the percentage of your total calls that can tolerate this delay. Report generation, nightly summarisation, and data enrichment pipelines are good candidates; live chat is not.
Batch discount rate %
The discount applied to batch calls. Most providers offer 50%. This is pre-filled from the model preset but editable if your agreement differs. The discount applies to the entire call cost (input + output), not just one tier.
Many AI agents call external search APIs (Brave, Tavily, Bing) to retrieve up-to-date information. This creates two separate charges: one for the search call itself, and one for the retrieved content that gets injected into the model's context.
Web searches per workflow
The average number of search API calls made per workflow. A simple chatbot that sometimes looks things up might average 0.3. A research agent that always searches might average 3–5. Set to 0 if your system does not use web search.
Search content tokens
After a search call returns results, those results are pasted into the model's context as additional input tokens. Search content tokens is the average number of tokens retrieved per search call. Search results are typically 1,000–4,000 tokens per call. These are billed at your model's input rate, because they enter the model as input.
Embedding
A numerical representation of text — a list of numbers (a "vector") that captures the meaning of a piece of text. Embeddings are used to power semantic search and RAG (retrieval-augmented generation) — finding relevant documents from a knowledge base based on meaning, not keyword matching. Embedding models are separate from language models and typically much cheaper.
Monthly embedding tokens
The total number of tokens you embed each month — both for indexing your knowledge base and for embedding user queries at runtime. A 10,000-document knowledge base with 500 tokens per doc = 5M tokens to index (done once). Embedding each user query might add another 100 tokens × number of queries per month.
OCR — Optical Character Recognition
Converting images or scanned PDFs into machine-readable text. AI document pipelines (legal, finance, healthcare) often need to extract text from uploaded PDFs before the language model can read them. OCR APIs (e.g. AWS Textract, Google Document AI) charge per page processed. Set to 0 if your system only handles plain text or already-digital documents.
OCR pages per workflow
The average number of document pages processed in a single workflow. A contract review agent might process 8–15 pages per run. A simple form processor might handle 1–2 pages. The total monthly OCR pages = workflows × pages per workflow.
These are the infrastructure costs that keep your system running — independent of how many workflows you process. Because they depend entirely on your vendor choices, the calculator takes your estimates directly rather than computing them.
Fixed infrastructure cost
Your monthly server, hosting, and platform costs — regardless of traffic. This includes: compute (EC2, Cloud Run, Fly.io), API gateway, monitoring (Datadog, Sentry), CI/CD pipelines, and any third-party orchestration services (LangSmith, Helicone). Does not scale with usage.
Vector storage cost
If your system uses a vector database for semantic search (Pinecone, Weaviate, Qdrant, pgvector), it has a monthly storage cost. This is typically flat-rate at low scale and scales with the size of your index. Included as a separate field because it is often the second-largest infrastructure line item in RAG systems.
Contingency buffer %
A safety margin applied to the entire cost estimate. Real usage rarely matches projections exactly — token counts vary, traffic spikes, caching behaves differently than expected. A 10–15% contingency is standard for planning purposes. It does not represent a real charge — it is a budget cushion.
The full cost calculation stacks each component in order:
A single number gives false precision. The calculator runs three scenarios using multipliers on your volume inputs — rates stay fixed. These are labelled P10 / P50 / P90 to indicate approximate percentile positions. The distribution is right-skewed: P90 is further from the median than P10, reflecting the real-world tendency for AI costs to blow out more than they collapse. P50 is the median estimate, not a probability-weighted mean.
| Scenario | Workflow volume | Token counts | Search calls | Retry rate | Fixed costs |
|---|---|---|---|---|---|
| P10 (low) | 70% of P50 | 85% of P50 | 70% of P50 | 50% of P50 | 90% of P50 |
| P50 (median) | 100% | 100% | 100% | 100% | 100% |
| P90 (heavy) | 150% of P50 | 120% of P50 | 140% of P50 | 200% of P50 | 110% of P50 |
The bar chart shows all three side by side. For budget approval, use the P90 figure — it covers a high-usage month. For unit economics and pricing decisions, use P50.
API cost / month
What you would owe AI providers at P50 volume. Calculated from published rates × your volume estimates. The rates are sourced from provider pricing pages; the volumes (tokens per call, retry %, cache-hit %) are your estimates and carry the same uncertainty as operating overhead inputs.
API cost per workflow
API cost divided by total monthly workflows. The most useful unit for product decisions — tells you what one task costs in AI charges before overhead. Use this to compare models or price individual features.
Operating overhead / month
Your fixed infrastructure costs (infra + vector storage). Based on your own estimates — the calculator does not compute these. Does not include contingency, which is shown separately in the budget view.
Total estimate / month
Planning total: API cost + operating overhead + contingency buffer. Used for budget approval. Profit, gross margin, and break-even use the P&L total (without contingency) — because contingency is a planning reserve, not a realized expense.
Cost per active user
(API cost + operating overhead) ÷ monthly active users. Excludes contingency. This is your unit economics baseline — the minimum revenue per user needed for the product to cover its direct costs.
Price for target gross margin
Uses only variable API costs per user (the correct COGS for gross margin). Formula: variable_cost_per_user / (1 − target_GM). This matches how gross margin is defined in a standard P&L — revenue minus direct cost of revenue, divided by revenue. It excludes fixed infra, monitoring, and contingency, which belong below the gross margin line. For a full operating-margin price, add fixed cost allocation on top.
Break-even users (CVP)
Uses the standard cost-volume-profit formula: fixed costs ÷ (revenue per user − variable cost per user). This is the volume at which contribution margin covers fixed costs. If variable cost per user exceeds revenue per user, there is no break-even — the calculator surfaces this as a warning. Note: excludes contingency from both fixed and variable costs.
Gross margin %
Only shown when you enter a revenue per user. (revenue − API costs) / revenue × 100. This is true gross margin — revenue minus variable COGS only. Fixed infra and contingency sit below the gross margin line. A SaaS business typically targets 60–80% gross margin on this basis.
Yearly run-rate
Planning total × 12. This is a flat run-rate, not a projection. It does not account for provider price changes (historically 20–40% annual decline), user growth, or discounting. Use it as a rough annual budget ceiling, not a financial forecast.
All prices are sourced directly from official provider documentation and pricing pages. We do not use third-party aggregators as a primary source.
| Provider | Source |
|---|---|
| OpenAI | openai.com/api/pricing |
| Anthropic | anthropic.com/pricing |
| cloud.google.com/vertex-ai pricing | |
| AWS Bedrock | aws.amazon.com/bedrock/pricing |
To allow fair comparison across providers, all prices are standardised as follows:
Prices are reviewed manually against official documentation. We aim to reflect provider price changes within 7 days of announcement. Major changes (new model releases, significant price drops) are typically reflected within 48 hours.
The calculator shows a "Prices last updated" timestamp in the pricing overrides section. The models comparison page subtitle indicates the month of the most recent full review.
ModelPricing is an independent resource. Editorial decisions — which models to include, how to describe them — are not influenced by providers or affiliate relationships. Affiliate partnerships cover tool recommendations only, not model pricing data.
Questions or corrections? contact@elythra.studio — or open the calculator and start with one of the presets to see the numbers in action.