Retrieval-Augmented Generation: Architecture Patterns

FlowRidge

Definition

Retrieval-Augmented Generation (RAG) is the architectural pattern in which a Generative Artificial Intelligence (AI) system, typically a Large Language Model (LLM), produces output grounded in retrieved documents from a curated knowledge source rather than relying solely on the model’s training knowledge. The pattern combines an information retrieval system (often a vector database with embedding-based semantic search, hybrid keyword-and-vector search, or knowledge graph traversal) with an LLM that generates responses using both the user’s query and the retrieved context. RAG has become the dominant pattern for enterprise Generative AI because it materially reduces hallucination, enables grounding in proprietary or recent information, and supports source attribution.

This article describes the canonical RAG architecture, the variations that have emerged for specific challenges, the operational considerations that determine quality and cost, and the governance considerations that make RAG defensible in regulated environments.

The Canonical RAG Pipeline

A baseline RAG pipeline has six stages.

1. Document Ingestion

Source documents (knowledge base articles, policies, product information, conversation logs) are ingested into the system. Ingestion includes parsing (handling PDF, HTML, Office formats), cleaning (removing boilerplate), and metadata extraction.

2. Chunking

Documents are split into chunks of manageable size for embedding and retrieval. Chunk size, chunk overlap, and chunking strategy (fixed-size, sentence-aware, semantic) all affect retrieval quality. The LangChain documentation at https://python.langchain.com/docs/concepts/text_splitters/ catalogues the common strategies.

3. Embedding Generation

Each chunk is converted to a vector (embedding) using an embedding model. Embedding model choice has significant downstream effects: dimensionality affects storage and retrieval cost; quality affects retrieval relevance.

4. Storage and Indexing

Embeddings are stored in a vector database (Pinecone, Weaviate, Qdrant, Milvus, pgvector) with appropriate indexing (HNSW, IVF, flat) for the scale and query pattern.

5. Retrieval

At query time, the user query is embedded and used to retrieve the most similar chunks. Hybrid retrieval combining vector similarity with keyword matching (BM25) usually outperforms pure vector retrieval. Re-ranking models can refine the top results.

6. Generation

The retrieved chunks are passed to the LLM along with the user query and a system prompt that instructs the model to ground its response in the provided context. The model generates the response, ideally with citations to specific source chunks.

Architectural Variations

Several variations address specific challenges.

Hybrid Search

Pure vector retrieval can miss exact keyword matches that the user clearly intended. Pure keyword retrieval misses semantic similarity. Hybrid search combining the two through reciprocal rank fusion or weighted combinations consistently outperforms either alone. The Microsoft research on hybrid search at https://learn.microsoft.com/en-us/azure/search/hybrid-search-overview describes the pattern in detail.

Re-Ranking

After initial retrieval, a re-ranking model (often a smaller LLM or a cross-encoder) scores each candidate against the query for relevance. Re-ranking improves the top-k quality at the cost of additional latency. The Cohere Rerank API and similar services have made this pattern broadly accessible.

Multi-Vector and Hierarchical Retrieval

Storing multiple embeddings per document (different chunk granularities, different aspects) enables more sophisticated retrieval. Hierarchical patterns retrieve broad documents first, then specific chunks within them.

Query Transformation

Transforming the user query before retrieval — by expansion, decomposition, or rewriting — can improve retrieval quality. A vague query “tell me about returns” might be expanded to “company return policy, process for returning items, return shipping options.”

Self-Querying and Filtering

The LLM generates structured query metadata (filters, date ranges, document types) from the natural language query, enabling precise filtering before semantic retrieval.

Knowledge Graph Augmentation

Combining vector retrieval with knowledge graph traversal supports use cases where relationships between entities matter (compliance dependencies, organisational reporting structures, product relationships). Microsoft’s GraphRAG project at https://github.com/microsoft/graphrag illustrates the pattern.

Agentic RAG

The retrieval step itself becomes a tool the LLM agent can invoke multiple times during a single response, asking different questions of the knowledge base as the response develops.

Operational Considerations

Embedding Refresh

When source documents change, their embeddings must be regenerated. Embedding refresh pipelines, change detection, and incremental updating are operational disciplines that distinguish production-grade RAG from prototype RAG.

Embedding Model Migration

Switching embedding models requires re-embedding the entire corpus (per the vendor lock-in discussion in Module 1.24). The migration must be planned, with parallel operation during cutover.

Chunking Strategy Tuning

Initial chunking strategies often need adjustment based on observed retrieval quality. Continuous improvement through quality measurement is essential.

Cost Management

RAG operations consume cost across embeddings (per token), vector storage, and LLM generation (per token). Per-decision cost tracking (per Module 1.24) reveals expensive use cases that need optimisation.

Latency Engineering

End-to-end latency from user query to response includes retrieval, re-ranking, generation, and any post-processing. Latency budgets should be allocated per stage with monitoring.

Caching

Identical or near-identical queries can hit caches at multiple layers (semantic cache, embedding cache, generation cache). Caching can materially reduce cost and latency for high-traffic patterns.

Governance Considerations

Source Attribution

Generated responses should cite the specific source chunks that grounded them. Attribution enables verification by the user and supports audit trails (per Module 1.21).

Source Authority

The retrieval corpus should be a curated source of authoritative information, not a general document dump. Including outdated, contradictory, or unauthoritative sources contaminates outputs.

Document Access Control

RAG must respect document-level access controls. Users should not see content from documents they are not authorised to access. The pattern requires propagating user context through the retrieval pipeline and filtering retrieval results accordingly.

Sensitive Data Handling

RAG systems often have access to sensitive data through the retrieval corpus. Generating responses that incorporate sensitive data may create new exposure paths (the LLM might surface details in unexpected ways). Sensitive-data redaction in retrieval results is sometimes appropriate.

Hallucination Despite Grounding

Even with retrieval grounding, LLMs can hallucinate — producing claims not supported by retrieved context, particularly when retrieval is weak. Faithfulness evaluation (such as the Ragas framework) measures the proportion of generated claims supported by retrieved context.

Evaluation and Monitoring

RAG systems should be evaluated on multiple dimensions: retrieval recall, retrieval precision, generation faithfulness, response quality, and end-to-end task success. The Stanford HELM and EleutherAI evaluation frameworks at https://crfm.stanford.edu/helm/ provide reference patterns; RAG-specific benchmarks (TruthfulQA, RAGAS) test the distinctive failure modes.

Common Failure Modes

The first is retrieval failure invisible to the user — the system retrieves irrelevant content but generates a confident response anyway. Counter with retrieval quality monitoring and confidence indicators in responses.

The second is cascading hallucination — the LLM elaborates beyond what retrieval supports, with each generated sentence less supported than the last. Counter with rigorous prompt engineering and faithfulness evaluation.

The third is stale corpus — the knowledge base is not refreshed and the system confidently cites outdated information. Counter with refresh cadence appropriate to information volatility and freshness indicators in responses.

The fourth is cost surprise — embedding generation, storage, and LLM inference costs accumulate faster than budgeted. Counter with proactive cost monitoring and per-use-case cost ceilings.

The fifth is evaluation by anecdote — quality assessed by trying a few queries rather than systematic evaluation. Counter with structured evaluation sets and continuous monitoring.

Looking Forward

The next article in Module 2.21 turns to multi-modal AI systems — Generative AI that handles images, audio, and video alongside text — which shares many architectural patterns with RAG and adds modality-specific considerations.