Retrieval-Augmented Generation (RAG) has become the default pattern for building AI systems that need to answer questions over proprietary data. The idea is straightforward: instead of relying solely on a language model's training data, you retrieve relevant documents at query time and feed them into the model as context.
But "RAG" is not a single architecture. There is an enormous gap between a naive implementation and a production-grade system. Having built RAG pipelines across legal, financial, healthcare, and operational domains, we've learned which patterns work, which ones are over-engineered, and when to use each.
Naive RAG vs Advanced RAG
Naive RAG is the simplest possible implementation: chunk your documents, embed them into a vector database, and at query time, embed the user's question, retrieve the top-k most similar chunks, and pass them to a language model with instructions to answer based on the provided context.
This works surprisingly well for simple use cases — internal FAQ bots, basic document search, and knowledge bases with well-structured content. If your documents are clean, your questions are straightforward, and you don't need high precision, naive RAG can be deployed in days.
The problems emerge when requirements get more demanding. Naive RAG struggles with multi-hop questions (where the answer spans multiple documents), ambiguous queries, documents with complex structure (tables, nested sections), and situations where precision matters more than recall.
Advanced RAG addresses these limitations through a series of improvements at each stage of the pipeline:
- Pre-retrieval: Query transformation, expansion, and decomposition. Instead of embedding the raw user query, you rewrite it to improve retrieval quality. For example, breaking "What are the payment terms and liability caps in the Smith contract?" into two separate retrieval queries.
- Retrieval: Hybrid search combining dense (embedding-based) and sparse (keyword-based) retrieval. Re-ranking retrieved results using a cross-encoder model that scores relevance more accurately than cosine similarity alone.
- Post-retrieval: Chunk compression, deduplication, and relevance filtering before passing context to the language model. This reduces noise and keeps the context window focused on genuinely relevant information.
For most enterprise applications, advanced RAG is where you want to be. The additional complexity is modest compared to the improvement in answer quality.
Chunking Strategies
How you split documents into chunks has a disproportionate impact on retrieval quality. Get this wrong and no amount of model sophistication will save you.
Fixed-size chunking (e.g., 500 tokens with 50-token overlap) is the simplest approach and works reasonably well for homogeneous documents like articles and reports. The overlap ensures you don't lose information at chunk boundaries.
Semantic chunking splits documents at natural boundaries — paragraphs, sections, or topic shifts detected by embedding similarity. This produces more coherent chunks that are easier for the retrieval system to match against queries. We recommend this as the default for most use cases.
Hierarchical chunking maintains parent-child relationships between chunks. When a small chunk is retrieved, you can also pull in its parent chunk for additional context. This is particularly effective for long documents with clear section structure, such as legal contracts or technical manuals.
Structured chunking for documents with tables, forms, or other structured content. Tables need special handling — chunking a table row by row destroys meaning. We typically extract tables separately, summarise them, and store both the summary (for retrieval) and the raw data (for precise answers).
A practical recommendation: start with semantic chunking at 400-800 tokens, measure retrieval precision, and only add complexity if you identify specific failure modes.
Embedding Models and Vector Databases
The embedding model converts text into numerical vectors that capture semantic meaning. Your choice here affects both quality and cost.
For English-language enterprise use cases, we currently recommend starting with OpenAI's text-embedding-3-large or Cohere's embed-v3. Both offer strong multilingual support and excellent retrieval performance. For on-premise deployments where data cannot leave your infrastructure, BGE-M3 and E5-Mistral are strong open-source alternatives.
For vector databases, the market has matured considerably. Pinecone offers the smoothest managed experience. Qdrant provides the best balance of performance and flexibility for self-hosted deployments. ChromaDB is excellent for prototyping and smaller-scale applications. And pgvector lets you add vector search to your existing PostgreSQL database, which is ideal when you want to avoid adding another infrastructure dependency.
For a detailed comparison, see our vector database comparison. For most projects, we recommend choosing based on your existing infrastructure and operational capabilities rather than raw benchmark performance.
Re-ranking and Hybrid Search
The single most impactful improvement you can make to a naive RAG system is adding a re-ranker. Here is why.
Embedding-based retrieval (dense retrieval) is fast but imprecise. It finds documents that are semantically similar to the query, but "semantically similar" is not always the same as "contains the answer." A cross-encoder re-ranker takes each retrieved document and the query as a pair, producing a much more accurate relevance score. The cost is higher latency (re-ranking 20-50 documents typically adds 200-500ms), but the quality improvement is substantial.
Cohere Rerank and open-source models like bge-reranker-v2 are our go-to choices. For latency-sensitive applications, you can retrieve a larger initial set (say, 50 documents) and re-rank to the top 5, getting both breadth and precision.
Hybrid search combines dense retrieval with sparse (keyword-based) retrieval using BM25 or similar algorithms. Dense retrieval excels at semantic matching but can miss exact keyword matches. Sparse retrieval catches those but misses semantic relationships. Combining both with reciprocal rank fusion gives you the best of both approaches.
This matters most when your documents contain domain-specific terminology, product codes, legal references, or other tokens where exact matching is essential.
Evaluation: How to Know Your RAG System Works
The most overlooked aspect of RAG development is systematic evaluation. Without it, you're flying blind — making changes and hoping they help.
We evaluate RAG systems at two levels:
- Retrieval quality: Are the right documents being retrieved? Measure this with precision@k, recall@k, and mean reciprocal rank (MRR). Build a test set of 50-100 question-document pairs and run your retrieval pipeline against them after every change.
- Answer quality: Given the retrieved documents, is the model generating correct, complete, and well-grounded answers? Use a combination of automated metrics (faithfulness, relevance, completeness) and human evaluation. Tools like RAGAS and custom LLM-as-judge pipelines make automated evaluation practical.
Start building your evaluation set from day one. Every time a user asks a question and you can verify the correct answer, add it to the set. This compounds over time and becomes your most valuable asset for iterating on the system.
For a deeper dive into building knowledge bases with RAG, see our knowledge base services.
RAG is powerful but the gap between a prototype and a production system is significant. If you're building a RAG pipeline and want to get it right first time, book a free architecture review and we'll help you choose the right pattern for your requirements.