RAG Pipeline Design Template
A technical design template for building a retrieval-augmented generation (RAG) pipeline. Covers the complete architecture from data ingestion through retrieval to response generation, including chunking strategies, embedding models, vector stores, and evaluation.
Overview
What's included
Architecture Overview
RAG Pipeline Architecture Overview
Project name: Author: Date: Status: Draft / In Review / Approved
System Purpose
What question or task will the RAG system answer?
Architecture Components
[Data Sources] → [Ingestion] → [Chunking] → [Embedding] → [Vector Store]
↓
[User Query] → [Query Processing] → [Retrieval] → [Reranking] → [Context Assembly] → [LLM Generation] → [Response]
Component Summary
| Component | Technology Choice | Rationale |
|---|---|---|
| Data sources | ||
| Document parser | ||
| Chunking method | ||
| Embedding model | ||
| Vector store | ||
| Retriever | ||
| Reranker | ||
| LLM (generation) | ||
| Orchestration |
Performance Requirements
- Query latency target: ms (p95)
- Throughput: queries/second
- Accuracy target: % (measured by )
- Freshness: Data updated every
Data Ingestion & Chunking
Data Ingestion & Chunking
Data Sources
| Source | Format | Size | Update Frequency | Access Method |
|---|---|---|---|---|
| PDF / HTML / JSON / CSV | docs | Daily / Weekly / Static | API / S3 / DB | |
| docs | ||||
| docs |
Document Processing Pipeline
-
Extraction: Convert source documents to plain text
- PDF: Use (e.g. PyMuPDF, Unstructured)
- HTML: Use (e.g. BeautifulSoup, Trafilatura)
- Tables: Use (e.g. Camelot, Tabula)
-
Cleaning: Remove noise and normalise text
- Strip headers, footers, page numbers
- Normalise whitespace and encoding
- Remove duplicate content
- Preserve document structure (headings, lists)
-
Metadata extraction:
- Document title, author, date
- Section headings and hierarchy
- Source URL or file path
- Custom metadata:
Chunking Strategy
Selected approach: Fixed-size / Semantic / Recursive / Document-aware
| Parameter | Value | Rationale |
|---|---|---|
| Chunk size | tokens | |
| Chunk overlap | tokens | |
| Splitting method | ||
| Metadata per chunk |
Chunking Considerations
- Tables: Kept whole / Split by row /
- Code blocks: Kept whole / Split by function /
- Lists: Kept with parent heading / Split individually /
- Images: OCR extracted / Alt-text only / Excluded /
Embedding & Retrieval
Embedding & Retrieval
Embedding Model Selection
| Model | Dimensions | Max Tokens | Cost | Benchmark Score |
|---|---|---|---|---|
| OpenAI text-embedding-3-large | 3072 | 8191 | $0.13/M tokens | |
| OpenAI text-embedding-3-small | 1536 | 8191 | $0.02/M tokens | |
| Cohere embed-v3 | 1024 | 512 | $0.10/M tokens | |
| BGE-large-en-v1.5 (open source) | 1024 | 512 | Self-hosted | |
| Voyage-3 | 1024 | 32000 | $0.06/M tokens |
Selected model: Rationale:
Vector Store Configuration
Selected store: Pinecone / Weaviate / Qdrant / pgvector / ChromaDB /
| Setting | Value |
|---|---|
| Index type | HNSW / IVF / Flat / |
| Distance metric | Cosine / Euclidean / Dot product |
| Dimensions | |
| Replicas | |
| Pods/Shards | |
| Metadata filtering | Enabled: fields |
Retrieval Strategy
| Parameter | Value | Notes |
|---|---|---|
| Top-k documents | Start with 5-10, tune based on accuracy | |
| Similarity threshold | Filter out low-relevance chunks | |
| Hybrid search | Yes / No | Combines semantic + keyword search |
| Keyword weight (if hybrid) | % | BM25 or similar |
| Reranking model | Cross-encoder or LLM-based | |
| Post-retrieval filtering | Metadata, recency, deduplication |
Generation & Evaluation
Generation & Evaluation
Prompt Template
You are a helpful assistant for [ORGANISATION/DOMAIN].
Answer the user's question using ONLY the information provided in the context below.
If the context does not contain enough information to answer, say so clearly.
Do not make up information.
## Context
{retrieved_chunks}
## Question
{user_query}
## Answer
Generation Parameters
| Parameter | Value | Rationale |
|---|---|---|
| LLM model | ||
| Temperature | (0.0-1.0) | Lower = more deterministic |
| Max tokens | ||
| System prompt | See template above | |
| Citation format | Inline / Footnotes / None |
Evaluation Framework
| Metric | Measurement Method | Target | Current |
|---|---|---|---|
| Answer relevance | LLM-as-judge (0-5 scale) | > 4.0 | |
| Faithfulness | Manual review + LLM check | > 95% | |
| Context precision | Relevant chunks / Total chunks | > 70% | |
| Context recall | Relevant retrieved / Total relevant | > 80% | |
| Latency (p50) | Instrumentation | < ms | ms |
| Latency (p95) | Instrumentation | < ms | ms |
| Hallucination rate | Manual review of 100 samples | < 5% | % |
Evaluation Dataset
- Create + question-answer pairs covering key topics
- Include edge cases and adversarial queries
- Label ground-truth answers for each question
- Label relevant source documents for each question
- Run evaluation before every major pipeline change
Instructions
How to use this template
Define the use case and data sources
Start by clearly defining what the RAG system will answer and which documents form the knowledge base. This drives every downstream decision.
Design the chunking strategy
Experiment with different chunk sizes (256-1024 tokens) and overlap settings. The right strategy depends on your content type.
Select and benchmark embedding models
Test 2-3 embedding models on your actual data. Use your evaluation dataset to measure retrieval quality before committing.
Build and tune the retrieval pipeline
Start simple (semantic search with top-5) and add complexity (hybrid search, reranking) only when evaluation shows it improves results.
Iterate on generation prompts
Write evaluation-driven prompt templates. Test against your evaluation dataset and refine based on faithfulness and relevance scores.
Establish continuous evaluation
Set up automated evaluation to run on every pipeline change. Monitor production metrics for regression.
Watch Out
Common mistakes to avoid
FAQ
Frequently asked questions
There is no universal answer. Start with 512 tokens with 50-token overlap for general documents. For technical documentation, try 256-384 tokens for more precise retrieval. For conversational content, try 768-1024 tokens. Always benchmark different sizes on your data.
Managed services (Pinecone, Weaviate Cloud) are best for getting started quickly and for production workloads where you want managed scaling. Self-hosted (Qdrant, pgvector) is better when you need data residency control, have cost constraints at scale, or already have infrastructure expertise.
Implement an incremental update pipeline: detect changed documents, re-chunk and re-embed only those documents, and upsert into the vector store. Use document IDs and version tracking to manage this efficiently.
Use a reranker when retrieval precision matters more than latency. Cross-encoder rerankers (like Cohere Rerank or a fine-tuned model) significantly improve the relevance of retrieved chunks but add 50-200ms latency. They are especially valuable when your initial retrieval returns many marginally relevant results.
Key strategies: use explicit instructions in the prompt to only answer from context, lower the temperature, implement faithfulness checking (compare answer claims against source chunks), add citations to enable verification, and set up human-in-the-loop review for high-stakes responses.
Need a custom AI template?
Our team can build tailored templates for your specific business needs. Book a free strategy call.