This article describes the canonical RAG architecture, the variations that have emerged for specific challenges, the operational considerations that determine quality and cost, and the governance considerations that make RAG defensible in regulated environments.
The Canonical RAG Pipeline
A baseline RAG pipeline has six stages.
1. Document Ingestion
Source documents (knowledge base articles, policies, product information, conversation logs) are ingested into the system. Ingestion includes parsing (handling PDF, HTML, Office formats), cleaning (removing boilerplate), and metadata extraction.
2. Chunking
Documents are split into chunks of manageable size for embedding and retrieval. Chunk size, chunk overlap, and chunking strategy (fixed-size, sentence-aware, semantic) all affect retrieval quality. The LangChain documentation at https://python.langchain.com/docs/concepts/text_splitters/ catalogues the common strategies.
3. Embedding Generation
Each chunk is converted to a vector (embedding) using an embedding model. Embedding model choice has significant downstream effects: dimensionality affects storage and retrieval cost; quality affects retrieval relevance.
4. Storage and Indexing
Embeddings are stored in a vector database (Pinecone, Weaviate, Qdrant, Milvus, pgvector) with appropriate indexing (HNSW, IVF, flat) for the scale and query pattern.
5. Retrieval
At query time, the user query is embedded and used to retrieve the most similar chunks. Hybrid retrieval combining vector similarity with keyword matching (BM25) usually outperforms pure vector retrieval. Re-ranking models can refine the top results.
6. Generation
The retrieved chunks are passed to the LLM along with the user query and a system prompt that instructs the model to ground its response in the provided context. The model generates the response, ideally with citations to specific source chunks.
Architectural Variations
Several variations address specific challenges.
Hybrid Search
Pure vector retrieval can miss exact keyword matches that the user clearly intended. Pure keyword retrieval misses semantic similarity. Hybrid search combining the two through reciprocal rank fusion or weighted combinations consistently outperforms either alone. The Microsoft research on hybrid search at https://learn.microsoft.com/en-us/azure/search/hybrid-search-overview describes the pattern in detail.
Re-Ranking
After initial retrieval, a re-ranking model (often a smaller LLM or a cross-encoder) scores each candidate against the query for relevance. Re-ranking improves the top-k quality at the cost of additional latency. The Cohere Rerank API and similar services have made this pattern broadly accessible.
Multi-Vector and Hierarchical Retrieval
Storing multiple embeddings per document (different chunk granularities, different aspects) enables more sophisticated retrieval. Hierarchical patterns retrieve broad documents first, then specific chunks within them.
Query Transformation
Transforming the user query before retrieval — by expansion, decomposition, or rewriting — can improve retrieval quality. A vague query “tell me about returns” might be expanded to “company return policy, process for returning items, return shipping options.”
Self-Querying and Filtering
The LLM generates structured query metadata (filters, date ranges, document types) from the natural language query, enabling precise filtering before semantic retrieval.
Knowledge Graph Augmentation
Combining vector retrieval with knowledge graph traversal supports use cases where relationships between entities matter (compliance dependencies, organisational reporting structures, product relationships). Microsoft’s GraphRAG project at https://github.com/microsoft/graphrag illustrates the pattern.
Agentic RAG
The retrieval step itself becomes a tool the LLM agent can invoke multiple times during a single response, asking different questions of the knowledge base as the response develops.
Operational Considerations
Embedding Refresh
When source documents change, their embeddings must be regenerated. Embedding refresh pipelines, change detection, and incremental updating are operational disciplines that distinguish production-grade RAG from prototype RAG.
Embedding Model Migration
Switching embedding models requires re-embedding the entire corpus (per the vendor lock-in discussion in Module 1.24). The migration must be planned, with parallel operation during cutover.
Chunking Strategy Tuning
Initial chunking strategies often need adjustment based on observed retrieval quality. Continuous improvement through quality measurement is essential.
Cost Management
RAG operations consume cost across embeddings (per token), vector storage, and LLM generation (per token). Per-decision cost tracking (per Module 1.24) reveals expensive use cases that need optimisation.
Latency Engineering
End-to-end latency from user query to response includes retrieval, re-ranking, generation, and any post-processing. Latency budgets should be allocated per stage with monitoring.
Caching
Identical or near-identical queries can hit caches at multiple layers (semantic cache, embedding cache, generation cache). Caching can materially reduce cost and latency for high-traffic patterns.
Governance Considerations
Source Attribution
Generated responses should cite the specific source chunks that grounded them. Attribution enables verification by the user and supports audit trails (per Module 1.21).
Source Authority
The retrieval corpus should be a curated source of authoritative information, not a general document dump. Including outdated, contradictory, or unauthoritative sources contaminates outputs.
Document Access Control
RAG must respect document-level access controls. Users should not see content from documents they are not authorised to access. The pattern requires propagating user context through the retrieval pipeline and filtering retrieval results accordingly.
Sensitive Data Handling
RAG systems often have access to sensitive data through the retrieval corpus. Generating responses that incorporate sensitive data may create new exposure paths (the LLM might surface details in unexpected ways). Sensitive-data redaction in retrieval results is sometimes appropriate.
Hallucination Despite Grounding
Even with retrieval grounding, LLMs can hallucinate — producing claims not supported by retrieved context, particularly when retrieval is weak. Faithfulness evaluation (such as the Ragas framework) measures the proportion of generated claims supported by retrieved context.
Evaluation and Monitoring
RAG systems should be evaluated on multiple dimensions: retrieval recall, retrieval precision, generation faithfulness, response quality, and end-to-end task success. The Stanford HELM and EleutherAI evaluation frameworks at https://crfm.stanford.edu/helm/ provide reference patterns; RAG-specific benchmarks (TruthfulQA, RAGAS) test the distinctive failure modes.
Common Failure Modes
The first is retrieval failure invisible to the user — the system retrieves irrelevant content but generates a confident response anyway. Counter with retrieval quality monitoring and confidence indicators in responses.
The second is cascading hallucination — the LLM elaborates beyond what retrieval supports, with each generated sentence less supported than the last. Counter with rigorous prompt engineering and faithfulness evaluation.
The third is stale corpus — the knowledge base is not refreshed and the system confidently cites outdated information. Counter with refresh cadence appropriate to information volatility and freshness indicators in responses.
The fourth is cost surprise — embedding generation, storage, and LLM inference costs accumulate faster than budgeted. Counter with proactive cost monitoring and per-use-case cost ceilings.
The fifth is evaluation by anecdote — quality assessed by trying a few queries rather than systematic evaluation. Counter with structured evaluation sets and continuous monitoring.
Looking Forward
The next article in Module 2.21 turns to multi-modal AI systems — Generative AI that handles images, audio, and video alongside text — which shares many architectural patterns with RAG and adds modality-specific considerations.
© FlowRidge.io — COMPEL AI Transformation Methodology. All rights reserved.