What is RAG (retrieval augmented generation) and how does it work?

RAG chunks documents, embeds them in a vector database, and retrieves relevant pieces at query time to inject into an LLM prompt. It grounds AI responses in your actual data instead of relying on training data alone.

Do you still need RAG with 200K+ context windows?

Not always. If your dataset fits in the context window, direct injection or cache-augmented generation (CAG) achieves equal or better results without retrieval infrastructure. RAG remains necessary for datasets too large for any context window.

What is cache-augmented generation (CAG)?

CAG pre-loads relevant knowledge into a cached prompt prefix, eliminating retrieval entirely. On HotPotQA benchmarks, CAG cut response time from 94s (RAG) to 2.3s while matching accuracy — with zero vector database infrastructure.

What is context engineering and how is it different from RAG?

Context engineering optimizes everything the model sees at generation time — schemas, tools, rules, live data. RAG is one context source among several. Context engineering orchestrates all sources into a unified runtime environment.

What is agentic RAG and how does it differ from traditional RAG?

Agentic RAG uses an AI agent to control retrieval loops — dynamically refining queries and re-retrieving based on reasoning. Traditional RAG is a single-pass pipeline. Agentic RAG is smarter but carries the full infrastructure stack.

When should you use RAG vs direct context injection?

Use RAG for datasets too large for context windows, constantly updating knowledge, or compliance audit trails. Use direct injection for bounded domains, agent systems, and static knowledge that fits in 200K tokens.

What is the cost of RAG vs long context windows?

RAG averages $0.00008 per query but requires infrastructure ($50-500/month). Long context costs $0.03-0.10 per query with zero infrastructure. Prompt caching from Anthropic and OpenAI reduces cached prefix costs by 4-10x.

Naive single-pass RAG is increasingly obsolete. Agentic RAG is evolving. But for most bounded agent use cases, context engineering with direct injection is simpler and equally effective. RAG isn't dead — it's becoming optional.

RAG vs Context Engineering: When Retrieval Adds Overhead

RAG — retrieval augmented generation — was the right answer when context windows were 4,096 tokens. You couldn't fit your data into the prompt, so you built a retrieval system to find the right chunks. Chunk your documents, embed them, store them in a vector database, retrieve the closest matches at query time, inject them into the prompt. Smart engineering for a real constraint.

But that constraint is disappearing. Claude handles 200K tokens. Gemini processes over 1M. The ceiling that justified RAG's complexity is gone — and the complexity isn't.

This isn't another "is RAG dead" article. There are plenty of those. This is a decision framework: when RAG earns its complexity, when it doesn't, and what context engineering looks like as the architecture that makes retrieval optional.

What Is RAG and How Does Retrieval Augmented Generation Work?

Retrieval augmented generation follows a straightforward pipeline. Take your documents — PDFs, wikis, database records — and split them into chunks. Each chunk gets converted into a numerical vector using an embedding model. Those vectors go into a vector database like Pinecone, Weaviate, or pgvector.

When a user asks a question, the system converts that question into a vector too. Then it runs a semantic search — finding the chunks whose vectors are closest to the question vector. The top results get injected into the LLM's prompt as context.

The promise: your AI answers questions using your actual data instead of hallucinating. Ground the LLM in reality. Reduce hallucinations. Keep knowledge current without retraining.

When context windows were 4K-8K tokens, this was brilliant. You couldn't fit more than a few pages into the prompt. RAG was the only way to give an LLM access to large datasets. Credit where it's due — RAG solved a real problem.

But the landscape has changed.

The RAG Stack: What You're Actually Building

A production RAG system requires more infrastructure than most teams anticipate.

Document ingestion. Loaders for every format — PDF, HTML, Markdown, database exports. PDF parsing alone is a rabbit hole of edge cases.

Chunking strategy. Split documents into pieces small enough for embedding but large enough to preserve meaning. Overlapping windows? Semantic boundaries? Get this wrong and you destroy context at chunk boundaries. A fact in paragraph 3 that depends on paragraph 1? Split into different chunks, that connection vanishes.

Embedding models. Convert text to numerical vectors. OpenAI's embeddings? A domain-specific model? Each has different dimensions, accuracy characteristics, and cost. Switch models later and you re-embed everything.

Vector database. Pinecone, Weaviate, Chroma, Qdrant, pgvector — an entire infrastructure layer to deploy, scale, monitor, and pay for. Each with their own query language, indexing strategy, and failure modes.

Retrieval and re-ranking. Pure semantic search? Hybrid with keyword search (BM25)? Re-ranking models for better precision? Each layer adds latency and complexity to the RAG pipeline.

Prompt assembly. Take retrieved chunks, format them, inject into the prompt. Handle edge cases: too many results, conflicting information, irrelevant chunks that passed the similarity threshold.

Every layer is a potential failure point. Research shows roughly 50% of naive RAG failures come from retrieval limitations — the system simply can't find the right documents. That's not a generation problem. That's a search problem.

The hidden costs compound: vector database hosting ($50-500/month), 200-500ms added latency per query just for retrieval, embedding model updates requiring full re-indexing, and constant monitoring for quality degradation.

Semantic search has a fundamental limitation that most RAG tutorials skip over: similarity is not relevance. The vector closest to your question isn't necessarily the most useful for answering it. Cosine similarity works for straightforward queries. It breaks when the answer requires connecting information scattered across multiple documents.

From Naive RAG to Agentic RAG: What's Actually Evolving

Is RAG dead? No. But naive RAG — the single-pass retrieve-then-generate pipeline — is increasingly insufficient for real workloads.

What's evolving is agentic RAG. Instead of a fixed pipeline, an AI agent controls the retrieval loop: retrieve → reason about what's missing → retrieve again with a refined query → verify the answer → iterate. The agent dynamically adjusts retrieval strategy based on query complexity.

Agentic RAG handles multi-hop questions that require connecting dots across documents. It recovers from bad initial retrievals by reformulating queries. Gartner forecasts 33% of enterprise software will include agentic AI by 2028, and agentic RAG is a major driver of that trend.

But agentic RAG still carries the full RAG infrastructure stack. Vector databases. Embedding pipelines. Chunking strategies. Re-ranking models. The retrieval is smarter, but the architecture hasn't gotten simpler. You've added an agent layer on top of the existing complexity.

The "Lost in the Middle" Problem

Stanford's "Lost in the Middle" research undermines the RAG thesis directly. LLMs lose 10-20+ percentage points in accuracy when relevant information sits in the middle of long contexts. Models attend better to information at the beginning and end — primacy and recency bias.

RAG chunks injected into a prompt typically land in the middle of the context. The retrieval worked perfectly — the right chunk was found — but the model underweighted it because of position. Perfect retrieval, imperfect attention.

The irony that nobody talks about enough: RAG was designed to prevent hallucinations. But bad retrieval creates new hallucination vectors. Miss the right chunk and the agent reasons on incomplete data. Inject a partially relevant chunk and the model confidently generates answers grounded in wrong context. That hallucination didn't come from training data — it came from the retrieval pipeline.

RAG vs fine-tuning is a different axis entirely. Fine-tuning changes what the model knows permanently. RAG changes what the model sees per query. Both are valid. But neither answers the architecture question: do you actually need the retrieval infrastructure?

CAG and Direct Context Injection: The Simpler Architecture

With 200K+ token context windows, there's an approach most teams overlook: put the data directly in the prompt.

Newsletter

Weekly insights on AI Architecture. No spam.

Cache-augmented generation (CAG) formalizes this. Pre-load your relevant knowledge into a cached prompt prefix. No retrieval step. No vector database. No chunking. No embedding model. The LLM sees everything — full context, full visibility, zero retrieval latency.

The benchmarks support this. On HotPotQA, CAG reduced generation time from 94.35 seconds with RAG to 2.33 seconds — a 40x speed improvement while matching accuracy. The paper "Don't Do RAG" (Chan et al., 2024) demonstrated that for knowledge tasks where data fits in the context window, CAG matches or exceeds RAG performance with dramatically less complexity.

Prompt caching makes this economical. Anthropic and OpenAI both offer prompt caching that reduces token costs 4-10x for cached prefixes. Pre-load your context once, cache it, and every subsequent query reuses those tokens at a fraction of the cost. CAG plus prompt caching: fast AND cheap, without a single vector database.

Direct context injection takes the same principle into agent architectures: at runtime, inject exactly the data the agent needs. Database schema. Business rules. Live API data. User-specific constraints. The agent sees the complete picture — not fragments selected by cosine similarity.

The cost math most teams get wrong: a RAG query costs ~$0.00008 in compute. Cheap per query. But add the vector database ($50-500/month), embedding pipeline maintenance, and engineering time. Direct injection at 200K tokens costs $0.03-0.10 per query with zero infrastructure overhead. For bounded datasets and agent systems, the total cost of ownership often favors injection.

Less infrastructure, less abstraction, better results. If you can skip the vector database, the embedding pipeline, the chunking strategy, and the re-ranking model — and get equal or better results — that's not a shortcut. That's architecture.

What Is Context Engineering and Why Does It Make RAG Optional?

There's a term gaining traction that captures what I've been building: context engineering.

Context engineering optimizes everything the model knows and perceives at generation time. Not just retrieval — that's one small piece. Context engineering includes schema injection, tool access via MCP, business rules, personality constraints, live data feeds, and verification requirements. The complete runtime environment that shapes model output.

RAG becomes one tool in the context engineering toolbox. Not the default. Not obsolete. Just one mechanism for getting knowledge into context, alongside direct injection, cached context, and MCP-mediated data access.

The full context engineering stack for agents has three layers:

MCP (Model Context Protocol) — the plumbing. Gives agents structured access to APIs, databases, business logic, and external tools. MCP vs RAG isn't a competition. They operate at different layers. MCP handles tool access. RAG or CAG handles knowledge access.

Context injection — the memory. Runtime assembly of schemas, rules, constraints, and relevant data. This is where RAG, CAG, or direct injection fits — as one mechanism within the broader context assembly.

Agent loop — the decision layer. Plan → execute → verify → iterate. The agent uses assembled context and tools to accomplish multi-step tasks with verification at every step.

What the industry now calls "context engineering" maps directly to what I've been calling environment design. The model doesn't need better training. It needs a better desk. Give it the schema, the business rules, the live data, the tools — at runtime. The environment becomes the prompt.

For agents, this is critical because they make sequential decisions. Each depends on correct context. A RAG retrieval miss at step 3 of a 10-step workflow cascades through every subsequent step. Direct injection eliminates retrieval risk by providing full visibility from the start.

The Decision Framework: When to Use What

No ideology. No tribalism. Architecture decisions based on actual constraints.

Use RAG when:

Your dataset exceeds any context window — millions of documents, terabytes of data
Knowledge updates constantly and requires near-real-time freshness
Multi-tenant access needs different data slices per user
Compliance demands citation tracking and retrieval audit trails
Unstructured data search at enterprise scale

Use CAG or direct injection when:

Your dataset fits in the context window (under 200K tokens for Claude, under 1M for Gemini)
Knowledge is static or semi-static — documentation, schemas, business rules
Agent systems where latency compounds across multi-step workflows
Bounded domain with predictable data needs
Simplicity and maintainability matter more than theoretical scale

Use agentic RAG when:

Complex multi-hop queries span large document collections
Query complexity varies widely, demanding dynamic retrieval strategies
Enterprise scale with diverse query types and heterogeneous data sources

The pattern I see constantly: teams default to RAG because it's the expected architecture, not because their constraints require it. They build vector databases for datasets that fit in a single 200K context window. They maintain embedding pipelines for knowledge bases that update monthly. They add months of infrastructure work to solve a problem that direct injection handles in a single API call.

If your data fits in context, skip RAG. If it doesn't, use RAG. Everything else is optimization, not architecture.

Complexity Is Not Architecture

RAG solved a real problem. For many use cases, that problem — limited context windows — no longer exists.

The simplest architecture that works is the best architecture. For agents operating in bounded domains, that means context engineering: runtime data injection, tool access via MCP, verification through structured outputs. No vector database. No embedding pipeline. No chunking strategy. Full visibility. Full control.

What the industry is discovering as context engineering — environment design, runtime injection, verification loops — is the architecture that actually matters. RAG is one possible input. Not the architecture itself.

RAG was the answer to a 4K-token world. I don't live there anymore. Build the environment, not the retrieval pipeline.