RAG Explained: Traditional vs Vectorless Retrieval-Augmented Generation

You built a chatbot using GPT-4. It’s impressive—until a customer asks about your latest product launch from last week. The bot confidently makes up features that don’t exist. Your support team is now spending hours correcting AI hallucinations.

This is the problem that nearly killed enterprise AI adoption. LLMs are brilliant, but they only know what they were trained on. Ask about anything after their training cutoff date, or anything specific to your business, and they’ll either admit ignorance or worse—hallucinate convincingly wrong answers.

RAG (Retrieval-Augmented Generation) solved this. Now ChatGPT can browse the web. Perplexity AI cites sources. Your enterprise chatbot can answer questions using your company’s internal docs. The AI doesn’t need to memorize everything—it just needs to know where to look.

In this guide, I’ll show you how RAG works, why traditional vector-based RAG isn’t always the answer, and how vectorless RAG is opening new possibilities. Real examples, real trade-offs, no fluff.

What Is RAG and Why Does It Matter?

RAG stands for Retrieval-Augmented Generation. Break that down:

Retrieval: Find relevant information from external sources (documents, databases, APIs) Augmented: Add that information to the AI’s context Generation: Let the AI generate a response using both its training and the retrieved info

Think of it like an open-book exam versus a closed-book exam. Without RAG, your AI is taking a closed-book exam—it can only use what it memorized during training. With RAG, it gets to look things up, cite sources, and give accurate answers based on current information.

The Problem RAG Solves

LLMs have three fundamental limitations:

Knowledge Cutoff: GPT-4’s training data ends in April 2023. Ask it about events after that, and it’s clueless. Your business changes daily—product updates, policy changes, new documentation. The AI needs access to current information.

Hallucinations: When LLMs don’t know something, they often make stuff up. And they do it confidently. This is catastrophic for customer support, medical advice, legal information, or anything where accuracy matters.

Domain-Specific Knowledge: GPT-4 knows general information, but it doesn’t know your company’s internal processes, your codebase, your customer data. You need a way to give it access to your specific knowledge.

RAG fixes all three problems. The AI retrieves current, accurate, domain-specific information and uses it to generate responses. No hallucinations (or at least, far fewer). No outdated information. No generic answers.

Real-World Impact

OpenAI’s ChatGPT: Added browsing capability in 2023. Now it can search the web, read articles, and cite sources. This transformed it from a knowledge snapshot into a research assistant.

Perplexity AI: Built entirely around RAG. Every answer includes citations to sources. It’s like having a research assistant that reads dozens of articles and summarizes them for you. Over 10 million monthly users.

Microsoft Copilot: Uses RAG to access your emails, documents, and calendar. It can answer “What did Sarah say about the Q4 budget?” by actually reading your emails, not guessing.

Notion AI: Searches your workspace to answer questions. “What were the action items from last week’s standup?” It finds the meeting notes and extracts the answer.

GitHub Copilot: Uses RAG to search your codebase and relevant documentation. It suggests code that matches your project’s patterns and conventions, not just generic examples.

The pattern is clear: RAG is how you make LLMs useful for real-world applications.

How Traditional RAG Works

Let’s break down the classic RAG pipeline that powers most AI applications today.

The Basic Flow

When a user asks a question, here’s what happens:

Convert the question to a vector using an embedding model
Search your knowledge base for documents with similar vectors
Retrieve the top K most relevant documents (usually 3-10)
Stuff those documents into the LLM’s context along with the question
Generate an answer using both the retrieved docs and the LLM’s knowledge

The magic is in step 2—semantic search using vector embeddings. This finds documents that are conceptually similar to the question, even if they don’t share exact keywords.

A Concrete Example

Let’s say you’re building a customer support chatbot for an e-commerce company. A customer asks: “How long does shipping take to Canada?”

Without RAG: The LLM might say “Typically 5-7 business days” based on general knowledge. But your company actually offers 2-day shipping to Canada. Wrong answer, unhappy customer.

With RAG:

Question gets converted to a vector
System searches your knowledge base (shipping policies, FAQ docs, etc.)
Finds the document: “Canada Shipping Policy - 2-day express available”
Passes both the question and the retrieved document to the LLM
LLM generates: “We offer 2-day express shipping to Canada. You can select this option at checkout.”

Accurate answer. Happy customer. That’s the power of RAG.

Building a Traditional RAG System

Let’s get practical. Here’s what you need to build a production RAG system.

Step 1: Prepare Your Knowledge Base

You need documents to retrieve from. This could be:

Product documentation
Customer support articles
Internal wikis
API documentation
Past conversations
Database records

The key is chunking—breaking documents into smaller pieces. Why? Because you can’t stuff an entire 50-page manual into the LLM’s context. You need to find the relevant sections.

Chunking strategies:

Fixed-size chunks: Split every 500 tokens. Simple but can break mid-sentence or mid-concept.

Semantic chunks: Split at natural boundaries (paragraphs, sections, topics). Better quality but requires more processing.

Sliding window: Overlapping chunks so context isn’t lost at boundaries. A sentence that ends chunk 1 also starts chunk 2.

Most production systems use semantic chunking with some overlap. Aim for 200-500 tokens per chunk—small enough to be specific, large enough to have context.

Step 2: Generate Embeddings

Convert each chunk to a vector using an embedding model. This is the same process we covered in the vector embeddings post—you’re creating numerical representations that capture meaning.

Popular embedding models:

OpenAI text-embedding-3-small (1,536 dimensions, $0.02 per 1M tokens)
Sentence Transformers (free, open source, 384 dimensions)
Cohere embeddings (multilingual, 1,024 dimensions)
Google’s Vertex AI embeddings (768 dimensions)

For most applications, Sentence Transformers is a solid starting point. It’s free, fast, and good enough. You can always upgrade later.

Step 3: Store in a Vector Database

You need a database optimized for similarity search. Regular databases can’t efficiently find “documents similar to this vector.”

Vector database options:

Pinecone: Managed, easy, scales automatically ($70/month for 1M vectors)
Weaviate: Open source, feature-rich, self-hosted
Qdrant: Rust-based, very fast, open source with managed option
Chroma: Simple, embedded, great for prototypes
pgvector: PostgreSQL extension, good if you’re already using Postgres

For prototyping, use Chroma—it’s dead simple. For production, Pinecone if you want managed, Qdrant if you want to self-host.

Step 4: Build the Retrieval Logic

When a query comes in:

Embed the query using the same model you used for documents
Search the vector database for top K similar chunks (K = 3-10 typically)
Optionally re-rank results using additional signals (recency, popularity, user permissions)
Return the most relevant chunks

The retrieval should take under 50ms. Any slower and your chatbot feels laggy.

Step 5: Augment and Generate

Now comes the LLM part. You construct a prompt that includes:

System instructions (“You are a helpful customer support agent”)
Retrieved documents (“Here are relevant docs: [doc1], [doc2], [doc3]”)
User question (“How long does shipping take to Canada?”)
Instructions (“Answer based on the provided documents. Cite sources.”)

The LLM reads everything and generates a response. Because it has the actual shipping policy in context, it gives an accurate answer.

Real-World RAG Implementations

Let’s look at how companies actually use RAG in production.

ChatGPT with Browsing

When you enable browsing in ChatGPT, here’s what happens behind the scenes:

You ask a question that requires current information
ChatGPT decides it needs to search (using a classifier or heuristic)
It generates search queries and uses Bing API to search the web
Retrieves top search results and fetches their content
Reads the web pages (with rate limiting and politeness)
Summarizes findings and generates a response with citations

The clever part? ChatGPT decides when to search. Not every question needs retrieval. “What’s 2+2?” doesn’t need a web search. “What’s the weather in Tokyo?” does.

This multi-step reasoning (should I search? what should I search for? how do I synthesize results?) is what makes it feel intelligent.

Perplexity AI: RAG as a Product

Perplexity built their entire product around RAG. Every answer includes citations. Here’s their approach:

User asks a question
Perplexity generates multiple search queries (query expansion)
Searches the web using multiple search engines
Retrieves and ranks results
Reads the top 10-20 sources
Generates a comprehensive answer with inline citations
Shows sources at the bottom for verification

The key innovation? They don’t just retrieve once. They do iterative retrieval—if the first set of results doesn’t answer the question, they search again with refined queries. This multi-hop retrieval dramatically improves answer quality.

Notion AI: Private Knowledge RAG

Notion’s AI searches your workspace—notes, docs, databases. The challenge? Privacy and permissions.

Their RAG system:

Only searches documents you have access to (permission-aware retrieval)
Chunks documents while preserving structure (headings, lists, tables)
Uses hybrid search (vector similarity + keyword matching)
Caches frequently accessed chunks for speed
Updates index in real-time as you edit documents

The result? You can ask “What did we decide about the pricing model?” and it finds the relevant meeting notes, even if they’re from 6 months ago and buried in a nested page.

Stripe Documentation Assistant

Stripe uses RAG to help developers find answers in their extensive API documentation. The interesting part? They combine multiple retrieval strategies:

Vector search: Finds semantically similar docs Keyword search: Matches exact API names and error codes Code search: Finds similar code examples Popularity ranking: Prioritizes frequently accessed docs

This hybrid approach handles different query types. “How do I create a payment intent?” benefits from semantic search. “What’s error code 402?” needs exact keyword matching.

The Limitations of Traditional RAG

Vector-based RAG is powerful, but it’s not perfect. Here are the real problems you’ll face.

Problem 1: The Chunking Dilemma

You need to chunk documents, but how? Too small and you lose context. Too large and you can’t fit enough chunks in the LLM’s context window.

Say you have a 10-page document about shipping policies. You chunk it into 20 pieces. A user asks about Canadian shipping. The relevant information is split across 3 chunks—one mentions the 2-day delivery, another mentions the cost, a third mentions customs.

Do you retrieve all 3? That uses up precious context space. Retrieve just 1? You give an incomplete answer.

There’s no perfect solution. You tune chunk size and overlap based on your specific documents and query patterns.

Problem 2: Semantic Search Isn’t Always Right

Vector search finds semantically similar content. But sometimes you need exact matches.

A user asks: “What’s the error code for invalid API key?” The answer is “401”. But vector search might return documents about “authentication errors” or “API security” that mention 401 in passing. The exact, direct answer gets buried.

This is why hybrid search (vectors + keywords) performs better than pure vector search. You need both semantic understanding and exact matching.

Problem 3: The Context Window Limit

LLMs have limited context windows. GPT-4 Turbo has 128K tokens, but that’s still finite. If you retrieve 10 documents of 1,000 tokens each, you’ve used 10K tokens just for context. That leaves less room for conversation history and the actual response.

You’re constantly making trade-offs: retrieve more documents for better coverage, or retrieve fewer to leave room for longer conversations?

Problem 4: Retrieval Latency

Every retrieval adds latency. Embedding the query takes 10-20ms. Vector search takes 20-50ms. Fetching document content takes another 10-30ms. That’s 40-100ms before you even call the LLM.

For a chatbot, that’s noticeable. Users expect instant responses. Every millisecond of latency hurts the experience.

Problem 5: The Cold Start Problem

Your RAG system is only as good as your knowledge base. If you don’t have documents covering a topic, retrieval returns nothing useful, and the LLM falls back to its training data (which might be outdated or wrong).

Building a comprehensive knowledge base takes time. You need to identify gaps, create missing documentation, and continuously update as things change.

Enter Vectorless RAG

Here’s a controversial take: you don’t always need vector embeddings for RAG. Sometimes simpler approaches work better, cost less, and are easier to maintain.

Vectorless RAG uses traditional retrieval methods—keyword search, SQL queries, API calls—instead of vector similarity search. And for many use cases, it’s actually superior.

What Is Vectorless RAG?

Instead of converting everything to vectors and doing similarity search, you use:

Keyword search: Good old Elasticsearch or PostgreSQL full-text search SQL queries: Direct database lookups based on structured data API calls: Fetch data from external services in real-time Graph traversal: Follow relationships in knowledge graphs Hybrid approaches: Combine multiple retrieval methods

The key insight: not all retrieval needs semantic understanding. Sometimes you just need to find the right record in a database or call the right API.

When Vectorless RAG Wins

Structured data queries: User asks “What’s my order status for order #12345?” You don’t need semantic search—you need a SQL query: SELECT status FROM orders WHERE id = 12345. Done in 5ms, no embeddings needed.

Exact matching: “What’s the error code for timeout?” You want documents containing “timeout” and “error code”, not semantically similar documents about “delays” or “failures”. Keyword search is faster and more accurate.

Real-time data: “What’s the current price of Bitcoin?” You don’t search documents—you call an API. The data changes every second; no point in indexing it.

Hierarchical navigation: “Show me all products in the Electronics > Laptops > Gaming category.” This is a tree traversal, not a similarity search. SQL or a graph database handles this better than vectors.

Multi-step reasoning: “Find customers who bought product A but not product B in the last 30 days.” This requires complex SQL joins and filters. Vectors can’t express this kind of logic.

A Concrete Vectorless RAG Example

Let’s build a customer support bot for an e-commerce site using vectorless RAG.

User asks: “Where’s my order?”

The system:

Extracts the user ID from the session
Runs SQL query: SELECT * FROM orders WHERE user_id = ? ORDER BY created_at DESC LIMIT 5
Gets the user’s recent orders
Formats the data as context for the LLM
LLM generates: “Your most recent order (#12345) shipped yesterday and will arrive March 31. Tracking: [link]”

No embeddings. No vector database. Just a SQL query and an LLM. Total latency? 10ms for the query + 1-2s for LLM generation. That’s faster than vector-based RAG.

User asks: “Can I return this?”

The system:

Identifies the product from context (order #12345, product ID 789)
Runs SQL: SELECT return_policy FROM products WHERE id = 789
Also queries: SELECT days_since_delivery FROM orders WHERE id = 12345
Retrieves: “30-day return policy” and “delivered 5 days ago”
LLM generates: “Yes, you can return this item. You have 25 days left in your 30-day return window. Here’s how: [instructions]”

Again, no vectors. Just structured data queries. The LLM gets exactly the information it needs, nothing more, nothing less.

Traditional RAG vs Vectorless RAG: The Showdown

Let’s compare these approaches across different scenarios.

Scenario 1: Customer Support

Question: “What’s my account balance?”

Traditional RAG:

Embed the question
Search vector database for similar documents
Might retrieve: FAQ about checking balances, documentation about account types
LLM generates generic answer
Latency: 50ms retrieval + 2s generation
Accuracy: Medium (no actual balance data)

Vectorless RAG:

Extract user ID from session
SQL query: SELECT balance FROM accounts WHERE user_id = ?
Get actual balance: $1,234.56
LLM generates: “Your current balance is $1,234.56”
Latency: 5ms query + 1s generation
Accuracy: Perfect (real data)

Winner: Vectorless RAG. Faster, more accurate, simpler.

Scenario 2: Documentation Search

Question: “How do I authenticate API requests?”

Traditional RAG:

Embed the question
Search documentation vectors
Retrieve relevant sections about authentication
LLM synthesizes answer from multiple docs
Latency: 40ms retrieval + 2s generation
Accuracy: High (finds conceptually related docs)

Vectorless RAG:

Keyword search for “authenticate” AND “API”
Might miss docs that use “authorization” instead
Retrieves fewer relevant results
Latency: 20ms search + 2s generation
Accuracy: Medium (keyword matching limitations)

Winner: Traditional RAG. Semantic understanding matters for documentation.

Scenario 3: Real-Time Data

Question: “What’s the weather in Tokyo right now?”

Traditional RAG:

Embed the question
Search for weather-related documents
Retrieves old weather reports or general info about Tokyo weather
LLM generates outdated answer
Latency: 40ms retrieval + 2s generation
Accuracy: Low (stale data)

Vectorless RAG:

Detect this is a weather query
Call weather API with location=”Tokyo”
Get current weather: 18°C, partly cloudy
LLM generates: “It’s currently 18°C and partly cloudy in Tokyo”
Latency: 100ms API call + 1s generation
Accuracy: Perfect (real-time data)

Winner: Vectorless RAG. Real-time data needs API calls, not document search.

Scenario 4: Complex Research

Question: “Compare the security features of AWS, Azure, and Google Cloud for healthcare applications.”

Traditional RAG:

Embed the question
Search for documents about cloud security and healthcare
Retrieves relevant sections from multiple sources
LLM synthesizes comprehensive comparison
Latency: 60ms retrieval + 5s generation
Accuracy: High (finds nuanced information across sources)

Vectorless RAG:

Keyword search for “AWS security healthcare”
Misses documents that discuss concepts without exact keywords
Retrieves fewer relevant results
LLM has less context to work with
Latency: 30ms search + 4s generation
Accuracy: Medium (misses semantic connections)

Winner: Traditional RAG. Complex research benefits from semantic understanding.

Hybrid RAG: The Best of Both Worlds

Here’s the truth: most production systems don’t choose one approach. They use both.

The Hybrid Approach

Build a query router that decides which retrieval method to use:

Structured data queries → SQL Real-time data → API calls Exact matching → Keyword search Semantic search → Vector embeddings Complex research → Multiple methods combined

The router can be rule-based (pattern matching) or ML-based (classifier that predicts query type). Most systems start with rules and add ML later.

Example: E-commerce Support Bot

Query: “Where’s my order?”

Router detects: order status query
Method: SQL lookup
Retrieval: 5ms

Query: “Do you have waterproof hiking boots?”

Router detects: product search
Method: Vector search + filters
Retrieval: 40ms

Query: “What’s your return policy?”

Router detects: policy question
Method: Keyword search in FAQ docs
Retrieval: 15ms

Query: “Compare your shipping options”

Router detects: comparison query
Method: Retrieve all shipping docs (keyword) + synthesize
Retrieval: 25ms

Each query type gets the optimal retrieval method. This is how you build production-quality RAG systems.

Real-World Hybrid Systems

Intercom: Their customer support AI uses SQL for user data, vector search for help articles, and API calls for real-time metrics. The router decides based on query intent.

Zendesk AI: Combines ticket history (SQL), knowledge base (vectors), and external integrations (APIs). They report 30% faster resolution times compared to pure vector RAG.

Salesforce Einstein: Uses graph traversal for relationship queries (“Show me all contacts at companies in the tech industry”), vector search for finding similar cases, and SQL for structured data. The hybrid approach handles the complexity of CRM data.

Advanced RAG Techniques

Once you have the basics working, here are techniques that significantly improve quality.

Query Expansion

Don’t just search with the user’s exact question. Generate multiple variations:

Original: “How do I reset my password?” Expansions:

“password reset process”
“forgot password recovery”
“change account password”
“reset login credentials”

Search with all variations and combine results. This catches documents that use different terminology.

LLMs are great at query expansion. Ask GPT-4 to generate 5 variations of a query, then search with all of them. Retrieval quality improves by 20-30%.

Hypothetical Document Embeddings (HyDE)

Here’s a clever trick: instead of embedding the query, have the LLM generate a hypothetical answer, then embed that answer and search for similar documents.

Why does this work? Because the hypothetical answer uses the same vocabulary and structure as actual documents. It’s more similar to what you’re looking for than the question itself.

Example:

Query: “How do I optimize database queries?”
Hypothetical answer: “To optimize database queries, use indexes on frequently queried columns, avoid SELECT *, use EXPLAIN to analyze query plans…”
Embed the hypothetical answer and search

This finds documents that actually explain query optimization, not just documents that mention “database” and “optimize.”

Re-Ranking

Don’t just use the top K results from vector search. Re-rank them using additional signals:

Recency: Newer documents might be more relevant Popularity: Frequently accessed docs are often higher quality User feedback: Docs with positive ratings rank higher Source authority: Official docs rank higher than community posts Cross-encoder scoring: Use a specialized model to score query-document pairs

The initial vector search is fast but approximate. Re-ranking with a more sophisticated model improves precision.

Cohere’s Rerank API is purpose-built for this. It takes your query and candidate documents, scores each pair, and returns them sorted by relevance. It’s slower than vector search alone but much more accurate.

Multi-Hop Retrieval

Sometimes one retrieval isn’t enough. You need to retrieve, read, then retrieve again based on what you learned.

Example:

Query: “What’s the recommended tire pressure for a 2023 Tesla Model 3?”
First retrieval: Find the Model 3 manual
Read: “Tire pressure specifications are in the vehicle placard”
Second retrieval: Search for “vehicle placard location Model 3”
Find: “The placard is on the driver’s door jamb”
Third retrieval: Get the actual pressure specs
Generate answer with all context

This iterative retrieval mimics how humans research—you find one piece of info, which leads you to the next, until you have everything you need.

Contextual Compression

You retrieved 10 documents, but they’re full of irrelevant information. Instead of passing all 10,000 tokens to the LLM, compress them first.

Use a smaller, faster LLM to extract only the relevant sentences from each document. Then pass the compressed context to the main LLM.

Before compression: 10 documents × 1,000 tokens = 10,000 tokens After compression: 10 documents × 200 tokens = 2,000 tokens

You’ve saved 8,000 tokens of context space. That’s room for more retrieved documents or longer conversation history.

LangChain’s ContextualCompressionRetriever does this automatically. It’s a game-changer for long documents.

Building Your First RAG System

Let’s get practical. Here’s how to build a simple RAG system in an afternoon.

The Minimal Setup

What you need:

Python with LangChain library
OpenAI API key (or use local LLM)
Chroma vector database (embedded, no setup needed)
Your documents (PDFs, text files, whatever)

The implementation:

Load your documents, split them into chunks, generate embeddings, store in Chroma. When a query comes in, retrieve relevant chunks, pass them to the LLM with the question, get an answer.

Total code? About 50 lines. Total time? 2-3 hours including testing.

The Production Setup

For production, you need more:

Infrastructure:

Managed vector database (Pinecone or Qdrant)
Caching layer (Redis for query results)
Monitoring (track latency, costs, quality)
Rate limiting (prevent abuse)

Quality improvements:

Hybrid search (vectors + keywords)
Query expansion
Re-ranking
Contextual compression
User feedback loop

Operational concerns:

Document update pipeline (how do you keep embeddings fresh?)
Permission handling (who can access what?)
Cost optimization (caching, batching)
Failure handling (what if vector DB is down?)

This takes weeks to build properly. But start simple and iterate.

Common Pitfalls and How to Avoid Them

I’ve seen teams waste months on RAG implementations that don’t work. Here are the mistakes to avoid.

Pitfall 1: Over-engineering from day one

You don’t need a sophisticated hybrid system with re-ranking and compression on day one. Start with basic vector search. Get it working. Then optimize based on actual problems you encounter.

I’ve seen teams spend 3 months building the perfect RAG system before testing it with real users. When they finally launched, they discovered their chunking strategy was completely wrong for their documents. Start simple, iterate fast.

Pitfall 2: Ignoring retrieval quality

Your RAG system is only as good as your retrieval. If you’re retrieving irrelevant documents, the LLM will generate garbage answers.

Monitor retrieval metrics: precision (are retrieved docs relevant?), recall (are you finding all relevant docs?), and latency. Set up logging to see what’s being retrieved for each query. You’ll quickly spot patterns and problems.

Pitfall 3: Chunk size guessing

Don’t just pick 500 tokens because that’s what the tutorial used. Test different chunk sizes with your actual documents and queries. I’ve seen optimal chunk sizes range from 200 to 2,000 tokens depending on document structure.

Run experiments: try 200, 500, 1,000, and 2,000 token chunks. Measure retrieval quality for each. Pick the winner.

Pitfall 4: Forgetting about cost

Embeddings cost money. If you’re embedding millions of documents, that adds up. OpenAI charges $0.02 per 1M tokens for embeddings. Sounds cheap until you’re processing 100M tokens.

Calculate costs before you build. Consider using smaller embedding models (384 dimensions instead of 1,536) or open-source alternatives. The quality difference is often negligible.

Pitfall 5: No fallback strategy

What happens when retrieval returns nothing useful? Your LLM falls back to its training data and might hallucinate.

Build a confidence threshold. If retrieval scores are below 0.7, tell the user “I don’t have enough information to answer that” instead of making something up. Honesty beats hallucination.

Evaluating RAG Performance

You can’t improve what you don’t measure. Here’s how to evaluate your RAG system.

Retrieval Metrics

Precision: Of the documents you retrieved, how many were actually relevant? Recall: Of all relevant documents, how many did you retrieve? MRR (Mean Reciprocal Rank): How high up is the first relevant document? NDCG (Normalized Discounted Cumulative Gain): Measures ranking quality

For most applications, focus on precision. Better to retrieve 3 highly relevant docs than 10 docs where only 3 are relevant.

Generation Metrics

Faithfulness: Does the answer stick to the retrieved documents, or does it hallucinate? Answer relevance: Does the answer actually address the question? Context relevance: Were the retrieved documents relevant to the question?

You can measure these automatically using LLM-as-a-judge. Have GPT-4 evaluate each answer on a 1-5 scale for faithfulness and relevance. It’s not perfect, but it’s better than nothing.

End-to-End Metrics

Latency: Total time from query to response (target: under 3 seconds) Cost per query: Embedding + retrieval + LLM generation costs User satisfaction: Thumbs up/down, explicit feedback Task completion rate: Did the user get what they needed?

The metric that matters most? User satisfaction. If users are happy, your RAG system is working.

Building a Test Set

Create a golden dataset of 100-200 question-answer pairs. For each:

The question
The expected answer
The documents that should be retrieved
The evaluation criteria

Run your RAG system against this test set regularly. Track metrics over time. This catches regressions when you make changes.

Pro tip: Start with 20 examples. Add more as you encounter edge cases in production. Your test set should evolve with your system.

The Decision Framework: Which RAG Approach to Use?

Here’s how to decide between traditional RAG, vectorless RAG, or hybrid.

Use Traditional RAG (Vector-Based) When:

✓ You have unstructured text (documentation, articles, support tickets) ✓ Questions can be phrased many different ways ✓ You need semantic understanding, not just keyword matching ✓ You’re doing research or analysis across many documents ✓ Your knowledge base is large (10,000+ documents)

Examples: Documentation search, research assistants, content discovery, semantic Q&A

Use Vectorless RAG When:

✓ You have structured data (databases, APIs) ✓ Questions require exact matching (IDs, codes, names) ✓ You need real-time data (prices, inventory, weather) ✓ Latency is critical (need sub-50ms retrieval) ✓ You want to minimize infrastructure complexity

Examples: Customer support (order status, account info), real-time data queries, database Q&A, transactional systems

Use Hybrid RAG When:

✓ You have both structured and unstructured data ✓ Different query types need different retrieval methods ✓ You need maximum accuracy and flexibility ✓ You have the engineering resources to build and maintain it

Examples: Enterprise chatbots, complex support systems, multi-source knowledge bases, production AI applications

The Practical Reality

Most successful RAG systems start simple and evolve:

Month 1: Basic vector RAG with Chroma and OpenAI embeddings Month 3: Add keyword search for exact matching Month 6: Implement query routing and hybrid retrieval Month 12: Add re-ranking, compression, and advanced techniques

Don’t try to build the perfect system on day one. Build something that works, measure it, improve it.

RAG in 2026: What’s Next?

The RAG landscape is evolving fast. Here’s what’s happening now.

Multimodal RAG

RAG isn’t just for text anymore. Companies are building systems that retrieve images, videos, audio, and code.

Google’s Gemini can search across text, images, and videos. Ask “Show me examples of modern kitchen designs” and it retrieves relevant images, analyzes them, and generates design suggestions.

GitHub Copilot uses RAG to search your codebase and relevant repositories. It retrieves code snippets, not just documentation, and suggests implementations that match your project’s patterns.

Agentic RAG

Instead of a single retrieve-and-generate step, AI agents decide what to retrieve, when to retrieve, and how to combine information from multiple sources.

Anthropic’s Claude with tool use can decide to search the web, query a database, call an API, or use its training data—all in a single conversation. It’s RAG with reasoning about retrieval strategy.

Fine-Tuned Retrieval Models

Generic embedding models are good, but domain-specific models are better. Companies are fine-tuning embedding models on their own data.

Cohere offers fine-tuning for their embedding models. Train on your documents and queries, and retrieval quality improves by 30-40%. The cost? A few hundred dollars and a day of compute time.

RAG + Long Context Windows

GPT-4 Turbo has 128K tokens. Claude 3 has 200K. Gemini 1.5 has 1M tokens. With these massive context windows, do we still need RAG?

Yes, but differently. Instead of retrieving 5 documents, you can retrieve 50. Instead of compressing context, you can include full documents. RAG becomes less about fitting information into limited space and more about finding the right information in the first place.

Key Takeaways

Let’s wrap this up with what actually matters.

RAG solves the fundamental problem of LLM knowledge limitations. It gives AI access to current, accurate, domain-specific information. This transforms LLMs from knowledge snapshots into dynamic research assistants.

Traditional RAG (vector-based) excels at semantic search. Use it for documentation, research, and any scenario where understanding meaning matters more than exact matching. The trade-off is higher latency and cost.

Vectorless RAG excels at structured data and exact matching. Use it for database queries, real-time data, and scenarios where speed and simplicity matter. The trade-off is no semantic understanding.

Hybrid RAG gives you the best of both worlds. Build a query router that picks the right retrieval method for each query type. This is how production systems work, but it requires more engineering effort.

Start simple, iterate based on real usage. Don’t over-engineer on day one. Build basic RAG, test with real users, measure what matters, then optimize. Most teams waste time building sophisticated systems for problems they don’t have yet.

Measure everything. Track retrieval quality, generation accuracy, latency, cost, and user satisfaction. You can’t improve what you don’t measure. Build a test set and run it regularly.

The future of AI applications isn’t just better LLMs—it’s better retrieval. RAG is how you make AI useful for real-world problems. Master it, and you’ll build AI that people actually want to use.

What’s Your RAG Challenge?

I’ve built RAG systems for documentation search, customer support, and internal knowledge bases. Each one taught me something new about what works and what doesn’t.

What are you building? Struggling with retrieval quality? Dealing with latency issues? Trying to decide between vector and vectorless approaches?

Let’s talk. Drop me a message—I’d love to hear about your RAG challenges and share what I’ve learned.

Connect with me:

Email: [your-email]
LinkedIn: [your-linkedin]
Twitter: [your-twitter]

Building AI that actually works is hard. But it’s also incredibly rewarding when you get it right. Let’s figure it out together.

RAG Explained: Traditional vs Vectorless Retrieval-Augmented Generation

RAG Explained: Traditional vs Vectorless Retrieval-Augmented Generation

What Is RAG and Why Does It Matter?

The Problem RAG Solves

Real-World Impact

How Traditional RAG Works

The Basic Flow

A Concrete Example

Building a Traditional RAG System

Step 1: Prepare Your Knowledge Base

Step 2: Generate Embeddings

Step 3: Store in a Vector Database

Step 4: Build the Retrieval Logic

Step 5: Augment and Generate

Real-World RAG Implementations

ChatGPT with Browsing

Perplexity AI: RAG as a Product

Notion AI: Private Knowledge RAG

Stripe Documentation Assistant

The Limitations of Traditional RAG

Problem 1: The Chunking Dilemma

Problem 2: Semantic Search Isn’t Always Right

Problem 3: The Context Window Limit

Problem 4: Retrieval Latency

Problem 5: The Cold Start Problem

Enter Vectorless RAG

What Is Vectorless RAG?

When Vectorless RAG Wins

A Concrete Vectorless RAG Example

Traditional RAG vs Vectorless RAG: The Showdown

Scenario 1: Customer Support

Scenario 2: Documentation Search

Scenario 3: Real-Time Data

Scenario 4: Complex Research

Hybrid RAG: The Best of Both Worlds

The Hybrid Approach

Example: E-commerce Support Bot

Real-World Hybrid Systems

Advanced RAG Techniques

Query Expansion

Hypothetical Document Embeddings (HyDE)

Re-Ranking

Multi-Hop Retrieval

Contextual Compression

Building Your First RAG System

The Minimal Setup

The Production Setup

Common Pitfalls and How to Avoid Them

Evaluating RAG Performance

Retrieval Metrics

Generation Metrics

End-to-End Metrics

Building a Test Set

The Decision Framework: Which RAG Approach to Use?

Use Traditional RAG (Vector-Based) When:

Use Vectorless RAG When:

Use Hybrid RAG When:

The Practical Reality

RAG in 2026: What’s Next?

Multimodal RAG

Agentic RAG

Fine-Tuned Retrieval Models

RAG + Long Context Windows

Key Takeaways

What’s Your RAG Challenge?

Share this article

Comments & Discussion

Related Articles

AI Agents: Complete Guide to Building Intelligent Systems with Tool Calling

System Design Fundamentals: Building Twitter from Scratch