RAG Performance and Cost: How to Make Retrieval-Augmented Generation Cheaper and Better
Retrieval-augmented generation, or RAG, is one of the most common patterns in AI products. It lets an LLM answer questions using your documents, knowledge base, tickets, policies, or product data.
But many RAG systems become expensive and unreliable because they send too much irrelevant context to the model.
Better retrieval usually improves both quality and cost.
Why RAG gets expensive
RAG cost often grows because of:
- large chunks
- too many retrieved documents
- duplicated context
- weak metadata filtering
- no reranking
- long system prompts
- unnecessary citations
- retries after low-quality answers
Every irrelevant token increases cost and can distract the model.
Start with chunking
Chunking determines what the retriever can find. If chunks are too large, you send unnecessary text. If chunks are too small, the model may miss context.
Good chunking depends on document type:
| Content type | Chunking approach |
|---|---|
| Help docs | Section-based chunks |
| Legal documents | Clause or heading-based chunks |
| Code docs | Function or page-based chunks |
| Support tickets | Conversation or issue-based chunks |
| PDFs | Heading-aware chunks when possible |Avoid arbitrary fixed-size chunks when document structure is available.
Use metadata filters
Metadata filters reduce irrelevant context before vector search results reach the model.
Useful filters include:
- product
- language
- customer
- document type
- region
- permission level
- date
- version
Metadata is often the cheapest way to improve RAG quality.
Add reranking
Vector search can retrieve semantically similar text that is not actually useful. Reranking helps reorder results based on relevance to the query.
Reranking is especially helpful when:
- documents are long
- queries are ambiguous
- many chunks are similar
- answers require exact policy details
- top-k retrieval returns noisy results
The added reranking cost can be worth it if it reduces prompt size and retries.
Control top-k
Sending the top 20 chunks to an LLM is rarely efficient. Test smaller values such as 3, 5, or 8.
Measure:
- answer accuracy
- token cost
- latency
- citation quality
- user satisfaction
More context is not always better.
Route RAG tasks by complexity
Not every RAG query needs the same model.
Examples:
- simple FAQ lookup: small fast model
- policy interpretation: stronger reasoning model
- long document synthesis: long-context model
- extraction from retrieved context: structured-output model
Routing RAG requests by complexity can reduce spend while preserving quality.
Log retrieval data
For each RAG answer, log:
- query
- retrieved document IDs
- chunk IDs
- similarity scores
- reranking scores
- final context length
- model used
- token usage
- answer status
This is essential for debugging bad answers.
Final thoughts
RAG quality is not only a model problem. It is a retrieval problem, a context problem, and a cost problem.
Before upgrading to a more expensive model, improve chunking, metadata filters, reranking, top-k limits, and routing. The result is often cheaper and better.