Building a Production
RAG System

Beyond "embed, search, generate"

December 2025

The Problem

Payment API documentation: 50+ pages

Endpoints, auth flows, error codes, webhooks, code examples...

Developers spend hours searching.
What if they could just ask?

The "Obvious" Solution

Embed docs → Vector search → LLM generates

This is RAG — Retrieval Augmented Generation

I tried it. It was mediocre.

18 Techniques Benchmarked

Same dataset. Same evaluation.

The Results

Technique	Score
Semantic Chunking	0.20
Simple RAG	0.30
HyDE	0.50
Reranker	0.70
Fusion (Hybrid Search)	0.83
CRAG	0.82
Adaptive RAG	0.86

The Pattern

Winners don't rely on one method

They combine and verify

The Architecture

Hybrid Search → Rerank → Relevance Check → Generate → Grounding

6 stages instead of 3

Stage 1: Hybrid Search

Vector search + Keyword search (BM25)

Why Both?

Vector Search

"How do I authenticate?"

→ finds "obtaining access tokens"

✓ Semantic meaning

✗ Exact terms

BM25 Search

"POST /api/v1/payment"

→ finds exact endpoint

✗ Semantic meaning

✓ Exact terms

Reciprocal Rank Fusion

def hybrid_search(query, k=10):
    vector_results = vector_store.similarity_search(query, k=k)
    bm25_results = bm25_retriever.get_relevant_documents(query)
    
    fused_scores = {}
    for results in [vector_results, bm25_results]:
        for rank, doc in enumerate(results):
            doc_id = doc.metadata["id"]
            if doc_id not in fused_scores:
                fused_scores[doc_id] = 0
            fused_scores[doc_id] += 1 / (60 + rank + 1)
    
    return sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)[:k]

Stage 2: Reranking

10 documents retrieved. Not all are relevant.

Score each one against the query.

Cross-Encoder Scoring

def rerank_documents(query, documents, top_n=5):
    scored = []
    
    for doc in documents:
        prompt = f"""Rate relevance from 0 to 10. Reply with only a number.

Query: {query}
Document: {doc.page_content[:500]}"""
        
        score = float(llm.invoke(prompt).content)
        scored.append({"doc": doc, "score": score})
    
    scored.sort(key=lambda x: x["score"], reverse=True)
    return [item["doc"] for item in scored[:top_n]]

Real Output

rerank_scores: [9, 8, 7, 3, 2]

Top 3: relevant (7+)

Bottom 2: noise — filtered out

Stage 3: Relevance Check

The CRAG pattern

Corrective Retrieval Augmented Generation

The Problem

Query doesn't match any documents well.

Simple RAG: generates a confident wrong answer

The Solution

def check_relevance(query, documents):
    context = "\n".join([doc.page_content[:300] for doc in documents])
    
    prompt = f"""Can this context answer the query? Reply 'yes' or 'no'.

Query: {query}
Context: {context}"""
    
    return "yes" in llm.invoke(prompt).content.lower()


def rag_with_retry(query):
    docs = retrieve(query)
    
    if not check_relevance(query, docs):
        new_query = rewrite_query(query)
        docs = retrieve(new_query)
        
        if not check_relevance(new_query, docs):
            return "I don't have information about this."
    
    return generate_answer(query, docs)

Stage 4: Grounding Check

LLM generates an answer...

Then verify: is it actually in the docs?

Hallucination Prevention

def check_grounding(answer, documents):
    context = "\n".join([doc.page_content for doc in documents])
    
    prompt = f"""Is this answer fully supported by the context?
Reply only 'yes' or 'no'.

Context: {context}

Answer: {answer}"""
    
    return "yes" in llm.invoke(prompt).content.lower()


def generate_with_verification(query, documents):
    answer = generate_answer(query, documents)
    
    if not check_grounding(answer, documents):
        answer = generate_answer(query, documents, strict=True)
    
    return answer

LangGraph Pipeline

from langgraph.graph import StateGraph, START, END

workflow = StateGraph(RAGState)

workflow.add_node("retrieve", retrieve_documents)
workflow.add_node("rerank", rerank_documents)
workflow.add_node("check_relevance", check_relevance)
workflow.add_node("rewrite", rewrite_query)
workflow.add_node("generate", generate_answer)
workflow.add_node("check_grounding", check_grounding)

workflow.add_edge(START, "retrieve")
workflow.add_edge("retrieve", "rerank")
workflow.add_edge("rerank", "check_relevance")

workflow.add_conditional_edges(
    "check_relevance",
    route_by_relevance,
    {"relevant": "generate", "not_relevant": "rewrite"})

workflow.add_edge("rewrite", "retrieve")
workflow.add_edge("generate", "check_grounding")
workflow.add_edge("check_grounding", END)

pipeline = workflow.compile()

Vector Store Setup

from langchain_postgres import PGVector
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

connection = "postgresql://user:pass@localhost/mydb"
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vector_store = PGVector(
    embeddings=embeddings,
    collection_name="docs",
    connection=connection,
)

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n## ", "\n### ", "\n\n", "\n", " "])

chunks = splitter.split_documents(documents)
vector_store.add_documents(chunks)

The Stack

Django PostgreSQL + pgvector LangChain LangGraph OpenAI

Key decision: pgvector instead of separate vector DB

One database for everything

Results

731

document chunks indexed

Query latency: ~3-4 seconds

Reranking is the bottleneck

Observability

Every stage is traceable

[retrieve] vector: 5, bm25: 5, fused: 10 [rerank] scores: [9, 8, 7, 3, 2] [relevance] passed [grounding] passed

Live Demo

i33ym.cc/rag

Enable Debug toggle to see the pipeline in real-time

Future Improvements

Local cross-encoder — faster reranking without API calls
Embedding cache — incremental document updates
User feedback — thumbs up/down to tune retrieval
Query classification — route to specialized sub-pipelines

The Takeaway

Simple RAG scores 0.30

Production RAG needs verification at every step

The benchmark winners share one trait: they don't trust any single component.

Resources

Essay: i33ym.cc/building-a-rag
Demo: i33ym.cc/rag
LangChain: docs.langchain.com
LangGraph: langchain-ai.github.io/langgraph

Questions?

i33ym.cc

Building a ProductionRAG System

The Problem

The "Obvious" Solution

18 Techniques Benchmarked

The Results

The Pattern

The Architecture

Stage 1: Hybrid Search

Why Both?

Vector Search

BM25 Search

Reciprocal Rank Fusion

Stage 2: Reranking

Cross-Encoder Scoring

Real Output

Stage 3: Relevance Check

The Problem

The Solution

Stage 4: Grounding Check

Hallucination Prevention

LangGraph Pipeline

Vector Store Setup

The Stack

Results

Observability

Live Demo

Future Improvements

The Takeaway

Resources

Questions?

Building a Production
RAG System