Practical lessons from shipping retrieval-augmented generation

What actually matters when you take RAG from a prototype to something you can put in front of users — chunking, evaluation, and the failure modes that only show up in production.

Sandeep Koppula Apr 22, 2026 8 min read

Retrieval-augmented generation looks deceptively simple in a demo: embed some documents, drop the top matches into a prompt, and let the model answer. The gap between that demo and something you are comfortable putting in front of real users is where most of the engineering lives. These are the lessons that have held up across our engagements.

Most quality problems are retrieval problems

When an answer is wrong, the instinct is to blame the model or rewrite the prompt. In practice, the model usually answered faithfully from bad context. Before you touch the prompt, look at what was retrieved — more often than not, the right passage was never in the window.

Chunk on structure (headings, sections, table rows), not a fixed character count.
Combine semantic search with keyword/BM25 — hybrid retrieval catches exact terms embeddings miss.
Attach metadata (source, date, access level) and filter on it before ranking.
Re-rank the top candidates; the first-pass vector hit is rarely the best one.

Build an evaluation set before you optimize

You cannot improve what you cannot measure. A small golden set of 50–100 real questions with known-good answers will tell you more than any amount of eyeballing. Score retrieval (did we fetch the right passage?) separately from generation (did we answer well from it?) so you know which half to fix.

Plan for the failure modes

Production traffic finds the edges fast. The failures that hurt are rarely the model being "dumb" — they are systemic and predictable, which means you can design for them up front.

Stale index: schedule re-indexing and surface the freshness of sources in the answer.
Hallucinated citations: ground every claim in a retrieved chunk and show the source.
Context overflow: budget tokens explicitly and truncate by relevance, not by accident.
Prompt injection from documents: treat retrieved text as untrusted input, not instructions.

None of this is exotic. It is the unglamorous discipline of treating an AI feature like any other production system — with tests, observability, and a clear-eyed view of how it breaks.

Want to go deeper on this?

We are happy to share how we would approach it for your stack.

Talk to us More insights