RAG Cost and Performance: Improve Retrieval and Lower LLM Spend

Retrieval-augmented generation, or RAG, is one of the most common patterns in AI products. It lets an LLM answer questions using your documents, knowledge base, tickets, policies, or product data.

But many RAG systems become expensive and unreliable because they send too much irrelevant context to the model.

Better retrieval usually improves both quality and cost.

Why RAG gets expensive

RAG cost often grows because of:

large chunks
too many retrieved documents
duplicated context
weak metadata filtering
no reranking
long system prompts
unnecessary citations
retries after low-quality answers

Every irrelevant token increases cost and can distract the model.

Start with chunking

Chunking determines what the retriever can find. If chunks are too large, you send unnecessary text. If chunks are too small, the model may miss context.

Good chunking depends on document type:

| Content type | Chunking approach |
|---|---|
| Help docs | Section-based chunks |
| Legal documents | Clause or heading-based chunks |
| Code docs | Function or page-based chunks |
| Support tickets | Conversation or issue-based chunks |
| PDFs | Heading-aware chunks when possible |

Avoid arbitrary fixed-size chunks when document structure is available.

Use metadata filters

Metadata filters reduce irrelevant context before vector search results reach the model.

Useful filters include:

product
language
customer
document type
region
permission level
date
version

Metadata is often the cheapest way to improve RAG quality.

Add reranking

Vector search can retrieve semantically similar text that is not actually useful. Reranking helps reorder results based on relevance to the query.

Reranking is especially helpful when:

documents are long
queries are ambiguous
many chunks are similar
answers require exact policy details
top-k retrieval returns noisy results

The added reranking cost can be worth it if it reduces prompt size and retries.

Control top-k

Sending the top 20 chunks to an LLM is rarely efficient. Test smaller values such as 3, 5, or 8.

Measure:

answer accuracy
token cost
latency
citation quality
user satisfaction

More context is not always better.

Route RAG tasks by complexity

Not every RAG query needs the same model.

Examples:

simple FAQ lookup: small fast model
policy interpretation: stronger reasoning model
long document synthesis: long-context model
extraction from retrieved context: structured-output model

Routing RAG requests by complexity can reduce spend while preserving quality.

Log retrieval data

For each RAG answer, log:

query
retrieved document IDs
chunk IDs
similarity scores
reranking scores
final context length
model used
token usage
answer status

This is essential for debugging bad answers.

Final thoughts

RAG quality is not only a model problem. It is a retrieval problem, a context problem, and a cost problem.

Before upgrading to a more expensive model, improve chunking, metadata filters, reranking, top-k limits, and routing. The result is often cheaper and better.