Working draft — Sancto AI is filling in the full version. The outline below is the structure we'll ship next week.
Why naive RAG fails in production
The tutorial version (embed everything, top-k by cosine, stuff into context, prompt) ships great demos and breaks at scale. It fails on three axes: recall (the right chunk isn't in the top-k), precision (the top-k is full of noise), and citation (you can't trace why the LLM said what it said).
Pattern 1: Query expansion
Before the embedding lookup, expand the user's query with related terms, synonyms, and clarifications via a cheap LLM call. Doubles recall on technical domains. Costs ~$0.001 per query. Worth it.
Pattern 2: Hybrid search (BM25 + dense)
Pure embeddings miss exact matches (product codes, names, regulations). Pure keyword search misses semantic equivalents. Run both, merge results with reciprocal rank fusion. The default for any RAG you're putting in front of a paying customer.
Pattern 3: Re-ranking
Retrieve 40 candidates, re-rank to 5 with a cross-encoder (Cohere Rerank, BAAI bge-reranker). Adds 80–200ms. Dramatically reduces hallucination. The cheapest big win.
Pattern 4: Agentic retrieval
Instead of one retrieval pass, the agent decides what to look up — and can issue follow-up retrievals based on what's missing. More expensive (3–8x token cost), but the answer quality is in a different league for complex queries.
Pattern 5: Structured outputs with citations
Force the LLM to return JSON with explicit citations to retrieved chunks. Makes the output auditable, lets you build "why did it say that?" UX. The trick most B2B teams skip until their first wrong answer in front of a customer.
If you're picking one to add today, pick re-ranking. Cheapest, biggest jump, no architectural change.
Full version of this article (with code samples and benchmarks from three of our production deployments) drops next week. Want the early draft? Email us.