Building a Production
RAG System

Beyond "embed, search, generate"

December 2025

The Problem

Payment API documentation: 50+ pages

Endpoints, auth flows, error codes, webhooks, code examples...

Developers spend hours searching.
What if they could just ask?

The "Obvious" Solution

Embed docs Vector search LLM generates

This is RAG — Retrieval Augmented Generation

I tried it. It was mediocre.

18 Techniques Benchmarked

Same dataset. Same evaluation.

The Results

TechniqueScore
Semantic Chunking0.20
Simple RAG0.30
HyDE0.50
Reranker0.70
Fusion (Hybrid Search)0.83
CRAG0.82
Adaptive RAG0.86

The Pattern

Winners don't rely on one method

They combine and verify

The Architecture

Hybrid Search Rerank Relevance Check Generate Grounding

6 stages instead of 3

Stage 1: Hybrid Search

Vector search + Keyword search (BM25)

Why Both?

Vector Search

"How do I authenticate?"

→ finds "obtaining access tokens"

Semantic meaning

Exact terms

BM25 Search

"POST /api/v1/payment"

→ finds exact endpoint

Semantic meaning

Exact terms

Reciprocal Rank Fusion

def hybrid_search(query, k=10):
    vector_results = vector_store.similarity_search(query, k=k)
    bm25_results = bm25_retriever.get_relevant_documents(query)
    
    fused_scores = {}
    for results in [vector_results, bm25_results]:
        for rank, doc in enumerate(results):
            doc_id = doc.metadata["id"]
            if doc_id not in fused_scores:
                fused_scores[doc_id] = 0
            fused_scores[doc_id] += 1 / (60 + rank + 1)
    
    return sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)[:k]

Stage 2: Reranking

10 documents retrieved. Not all are relevant.

Score each one against the query.

Cross-Encoder Scoring

def rerank_documents(query, documents, top_n=5):
    scored = []
    
    for doc in documents:
        prompt = f"""Rate relevance from 0 to 10. Reply with only a number.

Query: {query}
Document: {doc.page_content[:500]}"""
        
        score = float(llm.invoke(prompt).content)
        scored.append({"doc": doc, "score": score})
    
    scored.sort(key=lambda x: x["score"], reverse=True)
    return [item["doc"] for item in scored[:top_n]]

Real Output

rerank_scores: [9, 8, 7, 3, 2]

Top 3: relevant (7+)

Bottom 2: noise — filtered out

Stage 3: Relevance Check

The CRAG pattern

Corrective Retrieval Augmented Generation

The Problem

Query doesn't match any documents well.

Simple RAG: generates a confident wrong answer

The Solution

def check_relevance(query, documents):
    context = "\n".join([doc.page_content[:300] for doc in documents])
    
    prompt = f"""Can this context answer the query? Reply 'yes' or 'no'.

Query: {query}
Context: {context}"""
    
    return "yes" in llm.invoke(prompt).content.lower()


def rag_with_retry(query):
    docs = retrieve(query)
    
    if not check_relevance(query, docs):
        new_query = rewrite_query(query)
        docs = retrieve(new_query)
        
        if not check_relevance(new_query, docs):
            return "I don't have information about this."
    
    return generate_answer(query, docs)

Stage 4: Grounding Check

LLM generates an answer...

Then verify: is it actually in the docs?

Hallucination Prevention

def check_grounding(answer, documents):
    context = "\n".join([doc.page_content for doc in documents])
    
    prompt = f"""Is this answer fully supported by the context?
Reply only 'yes' or 'no'.

Context: {context}

Answer: {answer}"""
    
    return "yes" in llm.invoke(prompt).content.lower()


def generate_with_verification(query, documents):
    answer = generate_answer(query, documents)
    
    if not check_grounding(answer, documents):
        answer = generate_answer(query, documents, strict=True)
    
    return answer

LangGraph Pipeline

from langgraph.graph import StateGraph, START, END

workflow = StateGraph(RAGState)

workflow.add_node("retrieve", retrieve_documents)
workflow.add_node("rerank", rerank_documents)
workflow.add_node("check_relevance", check_relevance)
workflow.add_node("rewrite", rewrite_query)
workflow.add_node("generate", generate_answer)
workflow.add_node("check_grounding", check_grounding)

workflow.add_edge(START, "retrieve")
workflow.add_edge("retrieve", "rerank")
workflow.add_edge("rerank", "check_relevance")

workflow.add_conditional_edges(
    "check_relevance",
    route_by_relevance,
    {"relevant": "generate", "not_relevant": "rewrite"})

workflow.add_edge("rewrite", "retrieve")
workflow.add_edge("generate", "check_grounding")
workflow.add_edge("check_grounding", END)

pipeline = workflow.compile()
                

Vector Store Setup

from langchain_postgres import PGVector
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

connection = "postgresql://user:pass@localhost/mydb"
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vector_store = PGVector(
    embeddings=embeddings,
    collection_name="docs",
    connection=connection,
)

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n## ", "\n### ", "\n\n", "\n", " "])

chunks = splitter.split_documents(documents)
vector_store.add_documents(chunks)

The Stack

Django PostgreSQL + pgvector LangChain LangGraph OpenAI

Key decision: pgvector instead of separate vector DB

One database for everything

Results

731

document chunks indexed

Query latency: ~3-4 seconds

Reranking is the bottleneck

Observability

Every stage is traceable

[retrieve] vector: 5, bm25: 5, fused: 10 [rerank] scores: [9, 8, 7, 3, 2] [relevance] passed [grounding] passed

Live Demo

i33ym.cc/rag

Enable Debug toggle to see the pipeline in real-time

Future Improvements

  • Local cross-encoder — faster reranking without API calls
  • Embedding cache — incremental document updates
  • User feedback — thumbs up/down to tune retrieval
  • Query classification — route to specialized sub-pipelines

The Takeaway

Simple RAG scores 0.30

Production RAG needs verification at every step

The benchmark winners share one trait: they don't trust any single component.

Resources

Questions?

i33ym.cc