LLM API Pricing Explained: Tokens, Context, Caching, and Hidden Costs
LLM API pricing can look simple at first: pay for input tokens and output tokens. In production, the real cost is more complicated.
Your bill depends on prompt length, output length, model choice, retries, caching, embeddings, failed requests, long context, and how your application routes traffic.
This guide explains the main cost drivers so you can estimate and control LLM API spend.
Input tokens
Input tokens are the text you send to the model:
- system prompt
- user message
- conversation history
- retrieved documents
- tool definitions
- examples
- hidden instructions
Long prompts increase cost and latency. Review input size regularly.
Output tokens
Output tokens are the model's response. Depending on the provider, output tokens may cost more than input tokens.
Control output cost with:
- concise instructions
- maximum token limits
- structured formats
- shorter UI requirements
- summaries instead of full prose
Do not generate text your product will not use.
Context windows
A larger context window lets you send more text, but it does not make that text free.
Long-context requests can become expensive because they include many input tokens. Use long context when the task needs it, not as a default.
Caching
Some providers support prompt or context caching. Caching can reduce cost when the same prefix or document context is reused.
Good caching candidates:
- static system prompts
- common instructions
- unchanged documents
- repeated templates
- shared knowledge-base context
Caching rules vary by provider, so measure actual savings.
Retries
Retries are easy to forget. If a request fails twice before succeeding, it may cost more than expected.
Track retry cost separately and avoid retrying deterministic errors such as invalid parameters or prompts that exceed context limits.
Embeddings and RAG
RAG systems add costs beyond generation:
- embedding documents
- embedding queries
- vector database storage
- reranking
- longer prompts with retrieved context
RAG can reduce generation cost by sending less context, but poorly tuned RAG can increase cost.
Routing and model mix
Your average cost depends on model mix. A product that sends every request to a premium model will have a very different cost profile from one that routes simple tasks to cheaper models.
Track cost by:
- feature
- model
- provider
- customer
- plan
- request type
Hidden operational costs
Also consider:
- engineering time for provider integrations
- monitoring and logging
- quality evaluation
- incident response
- customer support from bad answers
- compliance review
- data retention requirements
The cheapest API price is not always the cheapest production system.
Final thoughts
LLM API pricing is a system-level problem. Tokens matter, but so do context size, retries, caching, embeddings, routing, and observability.
To control cost, measure usage at the request level, route by workload, limit unnecessary context, and review model mix regularly.