Beyond "embed, search, generate"
December 2025
Payment API documentation: 50+ pages
Endpoints, auth flows, error codes, webhooks, code examples...
Developers spend hours searching.
What if they could just ask?
This is RAG — Retrieval Augmented Generation
I tried it. It was mediocre.
Same dataset. Same evaluation.
| Technique | Score |
|---|---|
| Semantic Chunking | 0.20 |
| Simple RAG | 0.30 |
| HyDE | 0.50 |
| Reranker | 0.70 |
| Fusion (Hybrid Search) | 0.83 |
| CRAG | 0.82 |
| Adaptive RAG | 0.86 |
Winners don't rely on one method
They combine and verify
6 stages instead of 3
Vector search + Keyword search (BM25)
"How do I authenticate?"
→ finds "obtaining access tokens"
✓ Semantic meaning
✗ Exact terms
"POST /api/v1/payment"
→ finds exact endpoint
✗ Semantic meaning
✓ Exact terms
def hybrid_search(query, k=10):
vector_results = vector_store.similarity_search(query, k=k)
bm25_results = bm25_retriever.get_relevant_documents(query)
fused_scores = {}
for results in [vector_results, bm25_results]:
for rank, doc in enumerate(results):
doc_id = doc.metadata["id"]
if doc_id not in fused_scores:
fused_scores[doc_id] = 0
fused_scores[doc_id] += 1 / (60 + rank + 1)
return sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)[:k]
10 documents retrieved. Not all are relevant.
Score each one against the query.
def rerank_documents(query, documents, top_n=5):
scored = []
for doc in documents:
prompt = f"""Rate relevance from 0 to 10. Reply with only a number.
Query: {query}
Document: {doc.page_content[:500]}"""
score = float(llm.invoke(prompt).content)
scored.append({"doc": doc, "score": score})
scored.sort(key=lambda x: x["score"], reverse=True)
return [item["doc"] for item in scored[:top_n]]
Top 3: relevant (7+)
Bottom 2: noise — filtered out
The CRAG pattern
Corrective Retrieval Augmented Generation
Query doesn't match any documents well.
Simple RAG: generates a confident wrong answer
def check_relevance(query, documents):
context = "\n".join([doc.page_content[:300] for doc in documents])
prompt = f"""Can this context answer the query? Reply 'yes' or 'no'.
Query: {query}
Context: {context}"""
return "yes" in llm.invoke(prompt).content.lower()
def rag_with_retry(query):
docs = retrieve(query)
if not check_relevance(query, docs):
new_query = rewrite_query(query)
docs = retrieve(new_query)
if not check_relevance(new_query, docs):
return "I don't have information about this."
return generate_answer(query, docs)
LLM generates an answer...
Then verify: is it actually in the docs?
def check_grounding(answer, documents):
context = "\n".join([doc.page_content for doc in documents])
prompt = f"""Is this answer fully supported by the context?
Reply only 'yes' or 'no'.
Context: {context}
Answer: {answer}"""
return "yes" in llm.invoke(prompt).content.lower()
def generate_with_verification(query, documents):
answer = generate_answer(query, documents)
if not check_grounding(answer, documents):
answer = generate_answer(query, documents, strict=True)
return answer
from langgraph.graph import StateGraph, START, END
workflow = StateGraph(RAGState)
workflow.add_node("retrieve", retrieve_documents)
workflow.add_node("rerank", rerank_documents)
workflow.add_node("check_relevance", check_relevance)
workflow.add_node("rewrite", rewrite_query)
workflow.add_node("generate", generate_answer)
workflow.add_node("check_grounding", check_grounding)
workflow.add_edge(START, "retrieve")
workflow.add_edge("retrieve", "rerank")
workflow.add_edge("rerank", "check_relevance")
workflow.add_conditional_edges(
"check_relevance",
route_by_relevance,
{"relevant": "generate", "not_relevant": "rewrite"})
workflow.add_edge("rewrite", "retrieve")
workflow.add_edge("generate", "check_grounding")
workflow.add_edge("check_grounding", END)
pipeline = workflow.compile()
from langchain_postgres import PGVector
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
connection = "postgresql://user:pass@localhost/mydb"
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vector_store = PGVector(
embeddings=embeddings,
collection_name="docs",
connection=connection,
)
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n## ", "\n### ", "\n\n", "\n", " "])
chunks = splitter.split_documents(documents)
vector_store.add_documents(chunks)
Key decision: pgvector instead of separate vector DB
One database for everything
document chunks indexed
Query latency: ~3-4 seconds
Reranking is the bottleneck
Every stage is traceable
Enable Debug toggle to see the pipeline in real-time
Simple RAG scores 0.30
Production RAG needs verification at every step
i33ym.cc