i33ym

Building a RAG

What The Heck Is This ?

I built a system that lets you ask questions about technical documentation and get accurate answers. Instead of searching through 500+ pages of docs, you type a question like "How do I authenticate?" and get a direct answer with code examples.

This article explains how I built it, what went wrong at first, and what I learned along the way. If you've heard terms like "RAG" or "vector search" but aren't sure what they mean, this is for you.

The Problem I Was Trying to Solve

Technical documentation is hard to navigate. Even well-written docs become overwhelming when there are dozens of pages covering endpoints, authentication, error codes, webhooks, and code examples in multiple languages.

Developers waste hours searching for specific information. They know the answer is somewhere in the docs, but finding it takes longer than it should.

I wanted to build something simple: a chat interface where you ask a question and get an accurate answer pulled directly from the documentation.

The Obvious Solution (That Didn't Work Well)

When I first heard about AI-powered search, the approach seemed straightforward:

Step 1: Take all your documents and convert them into "embeddings" — basically, turn text into numbers that capture meaning.

Step 2: When someone asks a question, convert that question into numbers too.

Step 3: Find the documents whose numbers are most similar to the question's numbers.

Step 4: Give those documents to an AI model like GPT and ask it to answer the question.

This approach has a name: RAG, which stands for Retrieval Augmented Generation. "Retrieval" because you're retrieving relevant documents. "Augmented" because you're augmenting (adding to) what the AI knows. "Generation" because the AI generates an answer.

I built this. It worked, but not well. The answers were often wrong or incomplete. Sometimes it would confidently give incorrect information. Other times it would miss obvious answers that were clearly in the docs.

Why Simple RAG Fails

I found a research paper that tested multiple different RAG techniques on the same dataset. The results were eye-opening:

Simple RAG (what I described above) scored 0.30 out of 1.0. That means it only got about 30% of answers right.

Even worse, "semantic chunking" — a technique that many tutorials recommend — scored only 0.20. That's worse than the basic approach.

But some techniques scored much higher:

Fusion (combining multiple search methods): 0.83
CRAG (checking if results are relevant): 0.82
Adaptive RAG (adjusting strategy based on the question): 0.86

The difference between 0.30 and 0.86 is huge. It's the difference between a system that frustrates users and one that actually helps them.

What the Winners Do Differently

Looking at the top-performing techniques, I noticed a pattern: they don't trust any single method. They combine multiple approaches and verify results at every step.

Think of it like a careful researcher versus a lazy one. The lazy researcher does one Google search and uses the first result. The careful researcher searches multiple sources, cross-references them, checks if they actually answer the question, and only then writes a conclusion.

I rebuilt my system to work like the careful researcher.

The Architecture I Built

My improved system has six stages instead of the original three:

Stage 1: Hybrid Search — Search two different ways and combine results
Stage 2: Reranking — Score each result for relevance
Stage 3: Relevance Check — Verify we actually found useful information
Stage 4: Query Rewrite — If results are poor, rephrase and try again
Stage 5: Generate Answer — Create the response
Stage 6: Grounding Check — Verify the answer is supported by the documents

Let me explain each stage in plain English.

Stage 1: Hybrid Search

The basic RAG approach uses "vector search" — finding documents with similar meaning. This works well for conceptual questions. If you ask "How do I log in?", it will find documents about "authentication" even though those words are different.

But vector search fails badly for exact matches. If you search for "POST /api/v1/users", vector search might return documents about "creating accounts" because the meaning is similar. But you wanted the exact endpoint path, not a conceptual match.

There's another type of search called "keyword search" (specifically, an algorithm called BM25). This finds documents containing the exact words you typed. Great for technical terms and code, but terrible for conceptual questions.

My solution: run both searches and combine the results. Documents that appear in both searches rise to the top. This is called "hybrid search" and it works remarkably well.

A document about authentication that also contains the exact endpoint path you searched for will rank higher than a document that only matches one way.

Stage 2: Reranking

After hybrid search, I have about 10 documents that might be relevant. The problem is that search results are ranked by similarity, not by whether they actually answer the question.

A document might be very similar to your question but still not contain the answer. Another document might seem less similar but have exactly what you need.

Reranking fixes this. I take each document and ask: "On a scale of 0-10, how well does this document answer the original question?"

In my system, I use an AI model to do this scoring. For each of the 10 documents, I ask the model to rate its relevance. Then I sort by those scores and keep only the top results.

Real example from my system:

Rerank scores: [9, 8, 7, 3, 2]

See the gap? Three documents scored 7 or higher — those are clearly relevant. Two documents scored 3 or lower — those are noise that would have confused the final answer.

Without reranking, all five documents would go to the AI, and the irrelevant ones would dilute the good information.

Stage 3: Relevance Check

Sometimes no documents are relevant. Maybe the user asked about something not covered in the documentation. Maybe they phrased their question in an unusual way.

Simple RAG doesn't handle this well. If you ask about something not in the docs, it will still find the "most similar" documents (even if they're not very similar) and generate an answer based on them. The result is a confident-sounding but completely wrong answer.

My system includes a relevance check. After reranking, I ask: "Can these documents actually answer the user's question?"

If the answer is no, I don't proceed to generate a response. Instead, I try a different approach.

Stage 4: Query Rewrite

When the relevance check fails, maybe the problem isn't the documents — maybe it's how the question was phrased.

Users don't always use the same terminology as the documentation. They might ask "How do I log in?" when the docs only talk about "authentication tokens". They might ask "What's the price?" when the docs say "pricing" or "cost" or "fees".

Query rewriting takes the original question and rephrases it in different ways. "How do I log in?" might become "How do I authenticate?" or "How do I get an access token?"

Then I run the search again with this new phrasing. Often, this finds the relevant documents that the original search missed.

If the second search still fails the relevance check, I give up and tell the user I couldn't find the information. An honest "I don't know" is better than a wrong answer.

Stage 5: Generate Answer

Only now, after all that verification, do I actually generate an answer.

I take the top-ranked documents and send them to the AI model along with the user's question. The prompt explicitly tells the model to only use information from the provided documents and to admit if it can't find the answer.

This is important: the AI is not searching its general knowledge. It's only looking at the specific documents I retrieved. This prevents it from making up information that sounds plausible but isn't in the actual documentation.

Stage 6: Grounding Check

Even with all these precautions, AI models sometimes "hallucinate" — they generate information that seems reasonable but isn't actually in the source documents.

The grounding check is my final safety net. After generating an answer, I ask: "Is every claim in this answer supported by the provided documents?"

If something in the answer can't be traced back to the documents, I flag it. In some cases, I regenerate the answer with stricter instructions.

This catches subtle hallucinations that would otherwise slip through.

The Technology Stack

For those curious about the technical details, here's what I used:

Django — A Python web framework. I already had a website built with Django, so I added the RAG system to it.

PostgreSQL with pgvector — PostgreSQL is a database. pgvector is an extension that lets it store and search vector embeddings. This means I didn't need a separate "vector database" — my regular database handles everything.

LangChain — A Python library that makes it easier to work with AI models and build pipelines like this.

LangGraph — An extension of LangChain that lets you build complex workflows with conditional logic (like "if relevance check fails, rewrite the query").

OpenAI — I use their embedding model to convert text to vectors, and their GPT-4o-mini model for generating answers and doing the various checks.

How I Split Documents Into Chunks

One detail I glossed over: you can't just feed entire documents to the AI. They're too long, and most of the content won't be relevant to any given question.

Instead, you split documents into smaller "chunks" — maybe a few paragraphs each. Then you search for relevant chunks, not relevant documents.

But how you split matters a lot. If you split in the middle of a code example, that example becomes useless. If you split a step-by-step guide between steps 3 and 4, neither chunk makes sense on its own.

I use a "recursive" splitting strategy that tries to break at natural boundaries:

First, try to split at major headings (## in markdown).
If chunks are still too big, split at subheadings (###).
Then try paragraph breaks.
Then line breaks.
Finally, if nothing else works, split at spaces between words.

This keeps related content together as much as possible.

I also use "overlap" — each chunk includes a bit of text from the previous chunk. This helps when the answer spans a chunk boundary.

Results

After implementing all of this, my documentation assistant actually works well.

I indexed 731 chunks from the documentation. When you ask a question, you get an answer in about 3-4 seconds. That's slower than I'd like (reranking 10 documents with an AI model takes time), but it's acceptable.

More importantly, the answers are accurate. When the system knows something, it gives a correct answer with relevant code examples. When it doesn't know something, it says so instead of making things up.

I also built a debug panel that shows what's happening behind the scenes. You can see which documents were retrieved, how they scored, whether the relevance check passed, and whether the grounding check found any issues. This makes it easy to diagnose problems and understand why a particular answer was generated.

What I Would Do Differently

If I built this again, I'd change a few things:

Use a specialized reranking model. Right now, I use GPT for reranking, which requires 10 API calls per question. There are smaller, specialized models designed just for reranking that run locally and would be much faster and cheaper.

Cache more aggressively. Every time I update the documentation, I regenerate all the embeddings. A smarter system would only regenerate embeddings for documents that changed.

Collect user feedback. If users could click "thumbs up" or "thumbs down" on answers, I could use that data to improve the system over time.

Classify questions first. Different types of questions might benefit from different strategies. A conceptual question ("What is X?") and a lookup question ("What's the endpoint for Y?") could be routed to different pipelines optimized for each type.

Try It Yourself

If you want to see this in action, visit i33ym.cc/rag/

Enable the Debug toggle to see all the stages working in real-time. Ask a question and watch the rerank scores, relevance checks, and grounding verification as they happen.

The Bigger Lesson

RAG is not a solved problem. The naive approach — embed documents, search for similar ones, generate an answer — gets about 30% of answers right. That's not good enough for anything serious.

The systems that work well share a common principle: they don't trust any single component. They use multiple search methods, verify relevance, check that answers are grounded in sources, and have fallback strategies when things go wrong.

This applies beyond RAG. Any time you're building a system that needs to be reliable, the pattern is the same: don't rely on one method, verify at every step, and fail gracefully when verification fails.

The benchmark winners don't have some magical algorithm. They're just more careful. They check their work. That's the pattern worth copying.

↑ 2 ↓ 0

Comments (0)

No comments yet. Be the first to share your thoughts.

← back to essays