Machine Learning & AI

RAG Explained: Traditional vs Vectorless Retrieval-Augmented Generation

54 min read
Pawan Kumar
#RAG #Retrieval-Augmented Generation #AI #LLM #Vector Search #Machine Learning #ChatGPT
RAG Explained: Traditional vs Vectorless Retrieval-Augmented Generation

RAG Explained: Traditional vs Vectorless Retrieval-Augmented Generation

You built a chatbot using GPT-4. It’s impressive—until a customer asks about your latest product launch from last week. The bot confidently makes up features that don’t exist. Your support team is now spending hours correcting AI hallucinations.

This is the problem that nearly killed enterprise AI adoption. LLMs are brilliant, but they only know what they were trained on. Ask about anything after their training cutoff date, or anything specific to your business, and they’ll either admit ignorance or worse—hallucinate convincingly wrong answers.

RAG (Retrieval-Augmented Generation) solved this. Now ChatGPT can browse the web. Perplexity AI cites sources. Your enterprise chatbot can answer questions using your company’s internal docs. The AI doesn’t need to memorize everything—it just needs to know where to look.

In this guide, I’ll show you how RAG works, why traditional vector-based RAG isn’t always the answer, and how vectorless RAG is opening new possibilities. Real examples, real trade-offs, no fluff.


What Is RAG and Why Does It Matter?

RAG stands for Retrieval-Augmented Generation. Break that down:

Retrieval: Find relevant information from external sources (documents, databases, APIs) Augmented: Add that information to the AI’s context Generation: Let the AI generate a response using both its training and the retrieved info

Think of it like an open-book exam versus a closed-book exam. Without RAG, your AI is taking a closed-book exam—it can only use what it memorized during training. With RAG, it gets to look things up, cite sources, and give accurate answers based on current information.

The Problem RAG Solves

LLMs have three fundamental limitations:

Knowledge Cutoff: GPT-4’s training data ends in April 2023. Ask it about events after that, and it’s clueless. Your business changes daily—product updates, policy changes, new documentation. The AI needs access to current information.

Hallucinations: When LLMs don’t know something, they often make stuff up. And they do it confidently. This is catastrophic for customer support, medical advice, legal information, or anything where accuracy matters.

Domain-Specific Knowledge: GPT-4 knows general information, but it doesn’t know your company’s internal processes, your codebase, your customer data. You need a way to give it access to your specific knowledge.

RAG fixes all three problems. The AI retrieves current, accurate, domain-specific information and uses it to generate responses. No hallucinations (or at least, far fewer). No outdated information. No generic answers.

Real-World Impact

OpenAI’s ChatGPT: Added browsing capability in 2023. Now it can search the web, read articles, and cite sources. This transformed it from a knowledge snapshot into a research assistant.

Perplexity AI: Built entirely around RAG. Every answer includes citations to sources. It’s like having a research assistant that reads dozens of articles and summarizes them for you. Over 10 million monthly users.

Microsoft Copilot: Uses RAG to access your emails, documents, and calendar. It can answer “What did Sarah say about the Q4 budget?” by actually reading your emails, not guessing.

Notion AI: Searches your workspace to answer questions. “What were the action items from last week’s standup?” It finds the meeting notes and extracts the answer.

GitHub Copilot: Uses RAG to search your codebase and relevant documentation. It suggests code that matches your project’s patterns and conventions, not just generic examples.

The pattern is clear: RAG is how you make LLMs useful for real-world applications.

The Problem RAG Solves Comparison showing LLM limitations without RAG versus capabilities with RAG The Problem RAG Solves ❌ Without RAG LLM Alone Problems: 📅 Knowledge Cutoff "What happened in March 2026?" "I don't have information after April 2023" 🎭 Hallucinations "What's our refund policy?" *Makes up a plausible-sounding policy* 🏢 No Domain Knowledge "How do I use our internal API?" "I don't have access to your docs" Result: Unreliable for production use ✓ With RAG LLM + Retrieval Solutions: 📅 Current Information "What happened in March 2026?" *Searches web, finds articles, summarizes* 🎯 Accurate Answers "What's our refund policy?" *Retrieves policy doc, quotes exactly* 🏢 Domain Expertise "How do I use our internal API?" *Searches docs, provides code examples* Result: Production-ready AI applications RAG transforms LLMs from knowledge snapshots into dynamic research assistants

How Traditional RAG Works

Let’s break down the classic RAG pipeline that powers most AI applications today.

The Basic Flow

When a user asks a question, here’s what happens:

  1. Convert the question to a vector using an embedding model
  2. Search your knowledge base for documents with similar vectors
  3. Retrieve the top K most relevant documents (usually 3-10)
  4. Stuff those documents into the LLM’s context along with the question
  5. Generate an answer using both the retrieved docs and the LLM’s knowledge

The magic is in step 2—semantic search using vector embeddings. This finds documents that are conceptually similar to the question, even if they don’t share exact keywords.

A Concrete Example

Let’s say you’re building a customer support chatbot for an e-commerce company. A customer asks: “How long does shipping take to Canada?”

Without RAG: The LLM might say “Typically 5-7 business days” based on general knowledge. But your company actually offers 2-day shipping to Canada. Wrong answer, unhappy customer.

With RAG:

  1. Question gets converted to a vector
  2. System searches your knowledge base (shipping policies, FAQ docs, etc.)
  3. Finds the document: “Canada Shipping Policy - 2-day express available”
  4. Passes both the question and the retrieved document to the LLM
  5. LLM generates: “We offer 2-day express shipping to Canada. You can select this option at checkout.”

Accurate answer. Happy customer. That’s the power of RAG.

Traditional RAG Pipeline Architecture Complete flow diagram showing how traditional RAG systems work from query to response with vector embeddings Traditional RAG Pipeline: Step-by-Step 1. User Query "How long does shipping take?" 2. Embed Query [0.23, -0.45, 0.89, ...] 3. Vector Search Find similar documents Knowledge Base (Vector Database) 📄 Shipping Policy ✓ Similarity: 0.92 Retrieved 📄 Canada FAQ ✓ Similarity: 0.88 Retrieved 📄 Return Policy Similarity: 0.45 Not retrieved 4. Augment LLM Context Question + Retrieved Documents → LLM "Based on these docs: [Shipping Policy]... answer: How long..." 5. Generate Accurate Response "We offer 2-day express shipping to Canada. Select this option at checkout."

Building a Traditional RAG System

Let’s get practical. Here’s what you need to build a production RAG system.

Step 1: Prepare Your Knowledge Base

You need documents to retrieve from. This could be:

  • Product documentation
  • Customer support articles
  • Internal wikis
  • API documentation
  • Past conversations
  • Database records

The key is chunking—breaking documents into smaller pieces. Why? Because you can’t stuff an entire 50-page manual into the LLM’s context. You need to find the relevant sections.

Chunking strategies:

Fixed-size chunks: Split every 500 tokens. Simple but can break mid-sentence or mid-concept.

Semantic chunks: Split at natural boundaries (paragraphs, sections, topics). Better quality but requires more processing.

Sliding window: Overlapping chunks so context isn’t lost at boundaries. A sentence that ends chunk 1 also starts chunk 2.

Most production systems use semantic chunking with some overlap. Aim for 200-500 tokens per chunk—small enough to be specific, large enough to have context.

Step 2: Generate Embeddings

Convert each chunk to a vector using an embedding model. This is the same process we covered in the vector embeddings post—you’re creating numerical representations that capture meaning.

Popular embedding models:

  • OpenAI text-embedding-3-small (1,536 dimensions, $0.02 per 1M tokens)
  • Sentence Transformers (free, open source, 384 dimensions)
  • Cohere embeddings (multilingual, 1,024 dimensions)
  • Google’s Vertex AI embeddings (768 dimensions)

For most applications, Sentence Transformers is a solid starting point. It’s free, fast, and good enough. You can always upgrade later.

Step 3: Store in a Vector Database

You need a database optimized for similarity search. Regular databases can’t efficiently find “documents similar to this vector.”

Vector database options:

  • Pinecone: Managed, easy, scales automatically ($70/month for 1M vectors)
  • Weaviate: Open source, feature-rich, self-hosted
  • Qdrant: Rust-based, very fast, open source with managed option
  • Chroma: Simple, embedded, great for prototypes
  • pgvector: PostgreSQL extension, good if you’re already using Postgres

For prototyping, use Chroma—it’s dead simple. For production, Pinecone if you want managed, Qdrant if you want to self-host.

Step 4: Build the Retrieval Logic

When a query comes in:

  1. Embed the query using the same model you used for documents
  2. Search the vector database for top K similar chunks (K = 3-10 typically)
  3. Optionally re-rank results using additional signals (recency, popularity, user permissions)
  4. Return the most relevant chunks

The retrieval should take under 50ms. Any slower and your chatbot feels laggy.

Step 5: Augment and Generate

Now comes the LLM part. You construct a prompt that includes:

  • System instructions (“You are a helpful customer support agent”)
  • Retrieved documents (“Here are relevant docs: [doc1], [doc2], [doc3]”)
  • User question (“How long does shipping take to Canada?”)
  • Instructions (“Answer based on the provided documents. Cite sources.”)

The LLM reads everything and generates a response. Because it has the actual shipping policy in context, it gives an accurate answer.

RAG System Components and Architecture Detailed architecture diagram showing all components of a production RAG system Production RAG System Architecture Offline: Document Indexing (One-time) Documents PDFs, Docs, Web pages Chunking Split into 500-token pieces Embedding Convert to vectors Vector DB Store with HNSW index Run once, then update incrementally as docs change Online: Query Processing (Real-time) User Query "Shipping to Canada?" ~15ms Embed Query to vector ~30ms Search Find top 5 similar docs ~20ms Re-rank Score by relevance ~10ms LLM Generate response Retrieved Context (Top 3 Chunks) 📄 Chunk 1: "We offer express shipping to Canada with 2-day delivery..." 📄 Chunk 2: "International shipping rates: Canada $15, UK $20..." 📄 Chunk 3: "Customs clearance for Canadian orders typically takes 1-2 days..." These chunks are injected into the LLM prompt as context Total Latency: ~75ms retrieval + 1-3s LLM generation

Real-World RAG Implementations

Let’s look at how companies actually use RAG in production.

ChatGPT with Browsing

When you enable browsing in ChatGPT, here’s what happens behind the scenes:

  1. You ask a question that requires current information
  2. ChatGPT decides it needs to search (using a classifier or heuristic)
  3. It generates search queries and uses Bing API to search the web
  4. Retrieves top search results and fetches their content
  5. Reads the web pages (with rate limiting and politeness)
  6. Summarizes findings and generates a response with citations

The clever part? ChatGPT decides when to search. Not every question needs retrieval. “What’s 2+2?” doesn’t need a web search. “What’s the weather in Tokyo?” does.

This multi-step reasoning (should I search? what should I search for? how do I synthesize results?) is what makes it feel intelligent.

Perplexity AI: RAG as a Product

Perplexity built their entire product around RAG. Every answer includes citations. Here’s their approach:

  1. User asks a question
  2. Perplexity generates multiple search queries (query expansion)
  3. Searches the web using multiple search engines
  4. Retrieves and ranks results
  5. Reads the top 10-20 sources
  6. Generates a comprehensive answer with inline citations
  7. Shows sources at the bottom for verification

The key innovation? They don’t just retrieve once. They do iterative retrieval—if the first set of results doesn’t answer the question, they search again with refined queries. This multi-hop retrieval dramatically improves answer quality.

Notion AI: Private Knowledge RAG

Notion’s AI searches your workspace—notes, docs, databases. The challenge? Privacy and permissions.

Their RAG system:

  1. Only searches documents you have access to (permission-aware retrieval)
  2. Chunks documents while preserving structure (headings, lists, tables)
  3. Uses hybrid search (vector similarity + keyword matching)
  4. Caches frequently accessed chunks for speed
  5. Updates index in real-time as you edit documents

The result? You can ask “What did we decide about the pricing model?” and it finds the relevant meeting notes, even if they’re from 6 months ago and buried in a nested page.

Stripe Documentation Assistant

Stripe uses RAG to help developers find answers in their extensive API documentation. The interesting part? They combine multiple retrieval strategies:

Vector search: Finds semantically similar docs Keyword search: Matches exact API names and error codes Code search: Finds similar code examples Popularity ranking: Prioritizes frequently accessed docs

This hybrid approach handles different query types. “How do I create a payment intent?” benefits from semantic search. “What’s error code 402?” needs exact keyword matching.


The Limitations of Traditional RAG

Vector-based RAG is powerful, but it’s not perfect. Here are the real problems you’ll face.

Problem 1: The Chunking Dilemma

You need to chunk documents, but how? Too small and you lose context. Too large and you can’t fit enough chunks in the LLM’s context window.

Say you have a 10-page document about shipping policies. You chunk it into 20 pieces. A user asks about Canadian shipping. The relevant information is split across 3 chunks—one mentions the 2-day delivery, another mentions the cost, a third mentions customs.

Do you retrieve all 3? That uses up precious context space. Retrieve just 1? You give an incomplete answer.

There’s no perfect solution. You tune chunk size and overlap based on your specific documents and query patterns.

Problem 2: Semantic Search Isn’t Always Right

Vector search finds semantically similar content. But sometimes you need exact matches.

A user asks: “What’s the error code for invalid API key?” The answer is “401”. But vector search might return documents about “authentication errors” or “API security” that mention 401 in passing. The exact, direct answer gets buried.

This is why hybrid search (vectors + keywords) performs better than pure vector search. You need both semantic understanding and exact matching.

Problem 3: The Context Window Limit

LLMs have limited context windows. GPT-4 Turbo has 128K tokens, but that’s still finite. If you retrieve 10 documents of 1,000 tokens each, you’ve used 10K tokens just for context. That leaves less room for conversation history and the actual response.

You’re constantly making trade-offs: retrieve more documents for better coverage, or retrieve fewer to leave room for longer conversations?

Problem 4: Retrieval Latency

Every retrieval adds latency. Embedding the query takes 10-20ms. Vector search takes 20-50ms. Fetching document content takes another 10-30ms. That’s 40-100ms before you even call the LLM.

For a chatbot, that’s noticeable. Users expect instant responses. Every millisecond of latency hurts the experience.

Problem 5: The Cold Start Problem

Your RAG system is only as good as your knowledge base. If you don’t have documents covering a topic, retrieval returns nothing useful, and the LLM falls back to its training data (which might be outdated or wrong).

Building a comprehensive knowledge base takes time. You need to identify gaps, create missing documentation, and continuously update as things change.

Traditional RAG Challenges Diagram showing five major challenges in traditional vector-based RAG systems Traditional RAG: Five Key Challenges 1. Chunking Dilemma Information split across multiple chunks Too small = lost context Too large = can't fit enough in LLM context 2. Semantic Search Limits Misses exact matches (error codes, IDs) Returns conceptually similar but wrong docs Needs hybrid search (vectors + keywords) 3. Context Window Limits Can only fit 5-10 retrieved documents Trade-off: more context vs conversation history GPT-4: 128K tokens, but retrieval uses 10-20K 4. Retrieval Latency Embedding: 10-20ms Vector search: 20-50ms Total: 40-100ms before LLM even starts 5. Knowledge Base Quality RAG is only as good as your documents Missing docs = LLM falls back to (possibly wrong) training data Requires continuous maintenance and gap analysis These challenges led to the development of vectorless RAG approaches Sometimes simpler retrieval methods work better than complex vector search

Enter Vectorless RAG

Here’s a controversial take: you don’t always need vector embeddings for RAG. Sometimes simpler approaches work better, cost less, and are easier to maintain.

Vectorless RAG uses traditional retrieval methods—keyword search, SQL queries, API calls—instead of vector similarity search. And for many use cases, it’s actually superior.

What Is Vectorless RAG?

Instead of converting everything to vectors and doing similarity search, you use:

Keyword search: Good old Elasticsearch or PostgreSQL full-text search SQL queries: Direct database lookups based on structured data API calls: Fetch data from external services in real-time Graph traversal: Follow relationships in knowledge graphs Hybrid approaches: Combine multiple retrieval methods

The key insight: not all retrieval needs semantic understanding. Sometimes you just need to find the right record in a database or call the right API.

When Vectorless RAG Wins

Structured data queries: User asks “What’s my order status for order #12345?” You don’t need semantic search—you need a SQL query: SELECT status FROM orders WHERE id = 12345. Done in 5ms, no embeddings needed.

Exact matching: “What’s the error code for timeout?” You want documents containing “timeout” and “error code”, not semantically similar documents about “delays” or “failures”. Keyword search is faster and more accurate.

Real-time data: “What’s the current price of Bitcoin?” You don’t search documents—you call an API. The data changes every second; no point in indexing it.

Hierarchical navigation: “Show me all products in the Electronics > Laptops > Gaming category.” This is a tree traversal, not a similarity search. SQL or a graph database handles this better than vectors.

Multi-step reasoning: “Find customers who bought product A but not product B in the last 30 days.” This requires complex SQL joins and filters. Vectors can’t express this kind of logic.

A Concrete Vectorless RAG Example

Let’s build a customer support bot for an e-commerce site using vectorless RAG.

User asks: “Where’s my order?”

The system:

  1. Extracts the user ID from the session
  2. Runs SQL query: SELECT * FROM orders WHERE user_id = ? ORDER BY created_at DESC LIMIT 5
  3. Gets the user’s recent orders
  4. Formats the data as context for the LLM
  5. LLM generates: “Your most recent order (#12345) shipped yesterday and will arrive March 31. Tracking: [link]”

No embeddings. No vector database. Just a SQL query and an LLM. Total latency? 10ms for the query + 1-2s for LLM generation. That’s faster than vector-based RAG.

User asks: “Can I return this?”

The system:

  1. Identifies the product from context (order #12345, product ID 789)
  2. Runs SQL: SELECT return_policy FROM products WHERE id = 789
  3. Also queries: SELECT days_since_delivery FROM orders WHERE id = 12345
  4. Retrieves: “30-day return policy” and “delivered 5 days ago”
  5. LLM generates: “Yes, you can return this item. You have 25 days left in your 30-day return window. Here’s how: [instructions]”

Again, no vectors. Just structured data queries. The LLM gets exactly the information it needs, nothing more, nothing less.

Vectorless RAG Architecture Architecture diagram showing how vectorless RAG uses SQL, APIs, and keyword search instead of vector embeddings Vectorless RAG: Multiple Retrieval Strategies User Query "Where's my order #12345?" Query Router Decides which retrieval method to use SQL Query Structured data SELECT * FROM orders WHERE... ⚡ 5-10ms Keyword Search Exact term matching Elasticsearch Full-text search ⚡ 10-30ms API Calls Real-time data Weather, prices, stock levels ⚡ 50-200ms Graph Traversal Relationships Knowledge graphs Connected data ⚡ 10-50ms LLM Generation Synthesizes retrieved data into natural language response Key Advantage: Choose the right retrieval method for each query type

Traditional RAG vs Vectorless RAG: The Showdown

Let’s compare these approaches across different scenarios.

Scenario 1: Customer Support

Question: “What’s my account balance?”

Traditional RAG:

  • Embed the question
  • Search vector database for similar documents
  • Might retrieve: FAQ about checking balances, documentation about account types
  • LLM generates generic answer
  • Latency: 50ms retrieval + 2s generation
  • Accuracy: Medium (no actual balance data)

Vectorless RAG:

  • Extract user ID from session
  • SQL query: SELECT balance FROM accounts WHERE user_id = ?
  • Get actual balance: $1,234.56
  • LLM generates: “Your current balance is $1,234.56”
  • Latency: 5ms query + 1s generation
  • Accuracy: Perfect (real data)

Winner: Vectorless RAG. Faster, more accurate, simpler.

Question: “How do I authenticate API requests?”

Traditional RAG:

  • Embed the question
  • Search documentation vectors
  • Retrieve relevant sections about authentication
  • LLM synthesizes answer from multiple docs
  • Latency: 40ms retrieval + 2s generation
  • Accuracy: High (finds conceptually related docs)

Vectorless RAG:

  • Keyword search for “authenticate” AND “API”
  • Might miss docs that use “authorization” instead
  • Retrieves fewer relevant results
  • Latency: 20ms search + 2s generation
  • Accuracy: Medium (keyword matching limitations)

Winner: Traditional RAG. Semantic understanding matters for documentation.

Scenario 3: Real-Time Data

Question: “What’s the weather in Tokyo right now?”

Traditional RAG:

  • Embed the question
  • Search for weather-related documents
  • Retrieves old weather reports or general info about Tokyo weather
  • LLM generates outdated answer
  • Latency: 40ms retrieval + 2s generation
  • Accuracy: Low (stale data)

Vectorless RAG:

  • Detect this is a weather query
  • Call weather API with location=”Tokyo”
  • Get current weather: 18°C, partly cloudy
  • LLM generates: “It’s currently 18°C and partly cloudy in Tokyo”
  • Latency: 100ms API call + 1s generation
  • Accuracy: Perfect (real-time data)

Winner: Vectorless RAG. Real-time data needs API calls, not document search.

Scenario 4: Complex Research

Question: “Compare the security features of AWS, Azure, and Google Cloud for healthcare applications.”

Traditional RAG:

  • Embed the question
  • Search for documents about cloud security and healthcare
  • Retrieves relevant sections from multiple sources
  • LLM synthesizes comprehensive comparison
  • Latency: 60ms retrieval + 5s generation
  • Accuracy: High (finds nuanced information across sources)

Vectorless RAG:

  • Keyword search for “AWS security healthcare”
  • Misses documents that discuss concepts without exact keywords
  • Retrieves fewer relevant results
  • LLM has less context to work with
  • Latency: 30ms search + 4s generation
  • Accuracy: Medium (misses semantic connections)

Winner: Traditional RAG. Complex research benefits from semantic understanding.

Traditional vs Vectorless RAG Comparison Side-by-side comparison showing when to use traditional RAG versus vectorless RAG approaches Traditional RAG vs Vectorless RAG: When to Use What Traditional RAG (Vector Embeddings) ✓ Best For: • Semantic search across documents • Finding conceptually similar content • Research and analysis tasks • Unstructured text (articles, docs) • Questions with varied phrasing Example Queries: "How do I improve API performance?" "What are best practices for security?" "Compare different caching strategies" Pros: ✓ Understands meaning, not just keywords ✓ Handles varied question phrasing ✓ Great for exploratory queries Cons: ✗ Higher latency (40-100ms retrieval) ✗ More expensive (embedding costs) ✗ Complex infrastructure ✗ Misses exact matches sometimes Vectorless RAG (SQL, APIs, Keywords) ✓ Best For: • Structured data queries • Exact matching (IDs, codes, names) • Real-time data (prices, inventory) • Database lookups • API integrations Example Queries: "What's my order status?" "Show me products under $50" "What's the current Bitcoin price?" Pros: ✓ Much faster (5-50ms retrieval) ✓ Lower cost (no embedding fees) ✓ Simpler infrastructure ✓ Perfect for structured data Cons: ✗ No semantic understanding ✗ Keyword matching limitations ✗ Struggles with varied phrasing

Hybrid RAG: The Best of Both Worlds

Here’s the truth: most production systems don’t choose one approach. They use both.

The Hybrid Approach

Build a query router that decides which retrieval method to use:

Structured data queries → SQL Real-time data → API calls Exact matching → Keyword search Semantic search → Vector embeddings Complex research → Multiple methods combined

The router can be rule-based (pattern matching) or ML-based (classifier that predicts query type). Most systems start with rules and add ML later.

Example: E-commerce Support Bot

Query: “Where’s my order?”

  • Router detects: order status query
  • Method: SQL lookup
  • Retrieval: 5ms

Query: “Do you have waterproof hiking boots?”

  • Router detects: product search
  • Method: Vector search + filters
  • Retrieval: 40ms

Query: “What’s your return policy?”

  • Router detects: policy question
  • Method: Keyword search in FAQ docs
  • Retrieval: 15ms

Query: “Compare your shipping options”

  • Router detects: comparison query
  • Method: Retrieve all shipping docs (keyword) + synthesize
  • Retrieval: 25ms

Each query type gets the optimal retrieval method. This is how you build production-quality RAG systems.

Real-World Hybrid Systems

Intercom: Their customer support AI uses SQL for user data, vector search for help articles, and API calls for real-time metrics. The router decides based on query intent.

Zendesk AI: Combines ticket history (SQL), knowledge base (vectors), and external integrations (APIs). They report 30% faster resolution times compared to pure vector RAG.

Salesforce Einstein: Uses graph traversal for relationship queries (“Show me all contacts at companies in the tech industry”), vector search for finding similar cases, and SQL for structured data. The hybrid approach handles the complexity of CRM data.


Advanced RAG Techniques

Once you have the basics working, here are techniques that significantly improve quality.

Query Expansion

Don’t just search with the user’s exact question. Generate multiple variations:

Original: “How do I reset my password?” Expansions:

  • “password reset process”
  • “forgot password recovery”
  • “change account password”
  • “reset login credentials”

Search with all variations and combine results. This catches documents that use different terminology.

LLMs are great at query expansion. Ask GPT-4 to generate 5 variations of a query, then search with all of them. Retrieval quality improves by 20-30%.

Hypothetical Document Embeddings (HyDE)

Here’s a clever trick: instead of embedding the query, have the LLM generate a hypothetical answer, then embed that answer and search for similar documents.

Why does this work? Because the hypothetical answer uses the same vocabulary and structure as actual documents. It’s more similar to what you’re looking for than the question itself.

Example:

  • Query: “How do I optimize database queries?”
  • Hypothetical answer: “To optimize database queries, use indexes on frequently queried columns, avoid SELECT *, use EXPLAIN to analyze query plans…”
  • Embed the hypothetical answer and search

This finds documents that actually explain query optimization, not just documents that mention “database” and “optimize.”

Re-Ranking

Don’t just use the top K results from vector search. Re-rank them using additional signals:

Recency: Newer documents might be more relevant Popularity: Frequently accessed docs are often higher quality User feedback: Docs with positive ratings rank higher Source authority: Official docs rank higher than community posts Cross-encoder scoring: Use a specialized model to score query-document pairs

The initial vector search is fast but approximate. Re-ranking with a more sophisticated model improves precision.

Cohere’s Rerank API is purpose-built for this. It takes your query and candidate documents, scores each pair, and returns them sorted by relevance. It’s slower than vector search alone but much more accurate.

Multi-Hop Retrieval

Sometimes one retrieval isn’t enough. You need to retrieve, read, then retrieve again based on what you learned.

Example:

  • Query: “What’s the recommended tire pressure for a 2023 Tesla Model 3?”
  • First retrieval: Find the Model 3 manual
  • Read: “Tire pressure specifications are in the vehicle placard”
  • Second retrieval: Search for “vehicle placard location Model 3”
  • Find: “The placard is on the driver’s door jamb”
  • Third retrieval: Get the actual pressure specs
  • Generate answer with all context

This iterative retrieval mimics how humans research—you find one piece of info, which leads you to the next, until you have everything you need.

Contextual Compression

You retrieved 10 documents, but they’re full of irrelevant information. Instead of passing all 10,000 tokens to the LLM, compress them first.

Use a smaller, faster LLM to extract only the relevant sentences from each document. Then pass the compressed context to the main LLM.

Before compression: 10 documents × 1,000 tokens = 10,000 tokens After compression: 10 documents × 200 tokens = 2,000 tokens

You’ve saved 8,000 tokens of context space. That’s room for more retrieved documents or longer conversation history.

LangChain’s ContextualCompressionRetriever does this automatically. It’s a game-changer for long documents.

Advanced RAG Techniques Diagram showing four advanced RAG techniques that improve retrieval quality and efficiency Advanced RAG Techniques: Beyond Basic Retrieval 1. Query Expansion Original: "How do I reset my password?" ↓ LLM generates variations ↓ "password reset process" "forgot password recovery" "change account password" Impact: 20-30% better retrieval 2. HyDE (Hypothetical Docs) Query: "Optimize database queries" ↓ Generate hypothetical answer ↓ "Use indexes, avoid SELECT *, analyze with EXPLAIN..." ↓ Embed this, search for similar docs ↓ Impact: Finds docs with similar structure 3. Re-Ranking Initial vector search returns 50 candidates Re-rank by: • Cross-encoder similarity score • Recency (newer = better) • User feedback ratings • Source authority Return top 5 after re-ranking Impact: 15-25% accuracy improvement 4. Contextual Compression Retrieved: 10 docs × 1,000 tokens = 10,000 tokens ↓ Extract only relevant sentences ↓ Compressed: 10 docs × 200 tokens = 2,000 tokens Impact: 80% token reduction Fit 5x more context in same space

Building Your First RAG System

Let’s get practical. Here’s how to build a simple RAG system in an afternoon.

The Minimal Setup

What you need:

  • Python with LangChain library
  • OpenAI API key (or use local LLM)
  • Chroma vector database (embedded, no setup needed)
  • Your documents (PDFs, text files, whatever)

The implementation:

Load your documents, split them into chunks, generate embeddings, store in Chroma. When a query comes in, retrieve relevant chunks, pass them to the LLM with the question, get an answer.

Total code? About 50 lines. Total time? 2-3 hours including testing.

The Production Setup

For production, you need more:

Infrastructure:

  • Managed vector database (Pinecone or Qdrant)
  • Caching layer (Redis for query results)
  • Monitoring (track latency, costs, quality)
  • Rate limiting (prevent abuse)

Quality improvements:

  • Hybrid search (vectors + keywords)
  • Query expansion
  • Re-ranking
  • Contextual compression
  • User feedback loop

Operational concerns:

  • Document update pipeline (how do you keep embeddings fresh?)
  • Permission handling (who can access what?)
  • Cost optimization (caching, batching)
  • Failure handling (what if vector DB is down?)

This takes weeks to build properly. But start simple and iterate.

Common Pitfalls and How to Avoid Them

I’ve seen teams waste months on RAG implementations that don’t work. Here are the mistakes to avoid.

Pitfall 1: Over-engineering from day one

You don’t need a sophisticated hybrid system with re-ranking and compression on day one. Start with basic vector search. Get it working. Then optimize based on actual problems you encounter.

I’ve seen teams spend 3 months building the perfect RAG system before testing it with real users. When they finally launched, they discovered their chunking strategy was completely wrong for their documents. Start simple, iterate fast.

Pitfall 2: Ignoring retrieval quality

Your RAG system is only as good as your retrieval. If you’re retrieving irrelevant documents, the LLM will generate garbage answers.

Monitor retrieval metrics: precision (are retrieved docs relevant?), recall (are you finding all relevant docs?), and latency. Set up logging to see what’s being retrieved for each query. You’ll quickly spot patterns and problems.

Pitfall 3: Chunk size guessing

Don’t just pick 500 tokens because that’s what the tutorial used. Test different chunk sizes with your actual documents and queries. I’ve seen optimal chunk sizes range from 200 to 2,000 tokens depending on document structure.

Run experiments: try 200, 500, 1,000, and 2,000 token chunks. Measure retrieval quality for each. Pick the winner.

Pitfall 4: Forgetting about cost

Embeddings cost money. If you’re embedding millions of documents, that adds up. OpenAI charges $0.02 per 1M tokens for embeddings. Sounds cheap until you’re processing 100M tokens.

Calculate costs before you build. Consider using smaller embedding models (384 dimensions instead of 1,536) or open-source alternatives. The quality difference is often negligible.

Pitfall 5: No fallback strategy

What happens when retrieval returns nothing useful? Your LLM falls back to its training data and might hallucinate.

Build a confidence threshold. If retrieval scores are below 0.7, tell the user “I don’t have enough information to answer that” instead of making something up. Honesty beats hallucination.


Evaluating RAG Performance

You can’t improve what you don’t measure. Here’s how to evaluate your RAG system.

Retrieval Metrics

Precision: Of the documents you retrieved, how many were actually relevant? Recall: Of all relevant documents, how many did you retrieve? MRR (Mean Reciprocal Rank): How high up is the first relevant document? NDCG (Normalized Discounted Cumulative Gain): Measures ranking quality

For most applications, focus on precision. Better to retrieve 3 highly relevant docs than 10 docs where only 3 are relevant.

Generation Metrics

Faithfulness: Does the answer stick to the retrieved documents, or does it hallucinate? Answer relevance: Does the answer actually address the question? Context relevance: Were the retrieved documents relevant to the question?

You can measure these automatically using LLM-as-a-judge. Have GPT-4 evaluate each answer on a 1-5 scale for faithfulness and relevance. It’s not perfect, but it’s better than nothing.

End-to-End Metrics

Latency: Total time from query to response (target: under 3 seconds) Cost per query: Embedding + retrieval + LLM generation costs User satisfaction: Thumbs up/down, explicit feedback Task completion rate: Did the user get what they needed?

The metric that matters most? User satisfaction. If users are happy, your RAG system is working.

Building a Test Set

Create a golden dataset of 100-200 question-answer pairs. For each:

  • The question
  • The expected answer
  • The documents that should be retrieved
  • The evaluation criteria

Run your RAG system against this test set regularly. Track metrics over time. This catches regressions when you make changes.

Pro tip: Start with 20 examples. Add more as you encounter edge cases in production. Your test set should evolve with your system.


The Decision Framework: Which RAG Approach to Use?

Here’s how to decide between traditional RAG, vectorless RAG, or hybrid.

Use Traditional RAG (Vector-Based) When:

✓ You have unstructured text (documentation, articles, support tickets) ✓ Questions can be phrased many different ways ✓ You need semantic understanding, not just keyword matching ✓ You’re doing research or analysis across many documents ✓ Your knowledge base is large (10,000+ documents)

Examples: Documentation search, research assistants, content discovery, semantic Q&A

Use Vectorless RAG When:

✓ You have structured data (databases, APIs) ✓ Questions require exact matching (IDs, codes, names) ✓ You need real-time data (prices, inventory, weather) ✓ Latency is critical (need sub-50ms retrieval) ✓ You want to minimize infrastructure complexity

Examples: Customer support (order status, account info), real-time data queries, database Q&A, transactional systems

Use Hybrid RAG When:

✓ You have both structured and unstructured data ✓ Different query types need different retrieval methods ✓ You need maximum accuracy and flexibility ✓ You have the engineering resources to build and maintain it

Examples: Enterprise chatbots, complex support systems, multi-source knowledge bases, production AI applications

The Practical Reality

Most successful RAG systems start simple and evolve:

Month 1: Basic vector RAG with Chroma and OpenAI embeddings Month 3: Add keyword search for exact matching Month 6: Implement query routing and hybrid retrieval Month 12: Add re-ranking, compression, and advanced techniques

Don’t try to build the perfect system on day one. Build something that works, measure it, improve it.

RAG Decision Framework Decision tree showing how to choose between traditional RAG, vectorless RAG, or hybrid approach RAG Decision Framework: Choose Your Approach What type of data do you have? Structured (DB, APIs) Vectorless RAG Use SQL, APIs, keyword search ✓ Fast (5-50ms) ✓ Simple infrastructure ✓ Lower cost Unstructured (Docs, text) Traditional RAG Use vector embeddings ✓ Semantic understanding ✓ Handles varied phrasing ✓ Great for research Both types Hybrid RAG Route queries intelligently ✓ Best of both worlds ✓ Maximum flexibility ✗ More complex to build Additional Considerations Budget & Resources: • Limited budget? Start with vectorless or open-source embeddings • Small team? Keep it simple - complexity kills velocity Latency Requirements: • Need sub-100ms retrieval? Vectorless RAG or aggressive caching • Can tolerate 200-300ms? Traditional RAG works fine Scale: • Small knowledge base (<1,000 docs)? Chroma or even in-memory search • Large scale (1M+ docs)? Managed vector DB like Pinecone

RAG in 2026: What’s Next?

The RAG landscape is evolving fast. Here’s what’s happening now.

Multimodal RAG

RAG isn’t just for text anymore. Companies are building systems that retrieve images, videos, audio, and code.

Google’s Gemini can search across text, images, and videos. Ask “Show me examples of modern kitchen designs” and it retrieves relevant images, analyzes them, and generates design suggestions.

GitHub Copilot uses RAG to search your codebase and relevant repositories. It retrieves code snippets, not just documentation, and suggests implementations that match your project’s patterns.

Agentic RAG

Instead of a single retrieve-and-generate step, AI agents decide what to retrieve, when to retrieve, and how to combine information from multiple sources.

Anthropic’s Claude with tool use can decide to search the web, query a database, call an API, or use its training data—all in a single conversation. It’s RAG with reasoning about retrieval strategy.

Fine-Tuned Retrieval Models

Generic embedding models are good, but domain-specific models are better. Companies are fine-tuning embedding models on their own data.

Cohere offers fine-tuning for their embedding models. Train on your documents and queries, and retrieval quality improves by 30-40%. The cost? A few hundred dollars and a day of compute time.

RAG + Long Context Windows

GPT-4 Turbo has 128K tokens. Claude 3 has 200K. Gemini 1.5 has 1M tokens. With these massive context windows, do we still need RAG?

Yes, but differently. Instead of retrieving 5 documents, you can retrieve 50. Instead of compressing context, you can include full documents. RAG becomes less about fitting information into limited space and more about finding the right information in the first place.


Key Takeaways

Let’s wrap this up with what actually matters.

RAG solves the fundamental problem of LLM knowledge limitations. It gives AI access to current, accurate, domain-specific information. This transforms LLMs from knowledge snapshots into dynamic research assistants.

Traditional RAG (vector-based) excels at semantic search. Use it for documentation, research, and any scenario where understanding meaning matters more than exact matching. The trade-off is higher latency and cost.

Vectorless RAG excels at structured data and exact matching. Use it for database queries, real-time data, and scenarios where speed and simplicity matter. The trade-off is no semantic understanding.

Hybrid RAG gives you the best of both worlds. Build a query router that picks the right retrieval method for each query type. This is how production systems work, but it requires more engineering effort.

Start simple, iterate based on real usage. Don’t over-engineer on day one. Build basic RAG, test with real users, measure what matters, then optimize. Most teams waste time building sophisticated systems for problems they don’t have yet.

Measure everything. Track retrieval quality, generation accuracy, latency, cost, and user satisfaction. You can’t improve what you don’t measure. Build a test set and run it regularly.

The future of AI applications isn’t just better LLMs—it’s better retrieval. RAG is how you make AI useful for real-world problems. Master it, and you’ll build AI that people actually want to use.


What’s Your RAG Challenge?

I’ve built RAG systems for documentation search, customer support, and internal knowledge bases. Each one taught me something new about what works and what doesn’t.

What are you building? Struggling with retrieval quality? Dealing with latency issues? Trying to decide between vector and vectorless approaches?

Let’s talk. Drop me a message—I’d love to hear about your RAG challenges and share what I’ve learned.

Connect with me:

  • Email: [your-email]
  • LinkedIn: [your-linkedin]
  • Twitter: [your-twitter]

Building AI that actually works is hard. But it’s also incredibly rewarding when you get it right. Let’s figure it out together.

Share this article

Help others discover this content

Comments & Discussion

Join the conversation! Share your thoughts, ask questions, or provide feedback below.

Continue Reading

Related Articles

Explore more content you might find interesting