RAG demos are easy. Reliable RAG systems are not. The difference lives in the unglamorous layers: data quality, retrieval evaluation, observability, latency budgets and explicit failure handling.

The production loop

flowchart LR
    A[Source data] --> B[Parse and clean]
    B --> C[Chunk and enrich]
    C --> D[(Vector index)]
    Q[User query] --> R[Retrieve and rerank]
    D --> R
    R --> G[Grounded generation]
    G --> E[Evaluation and traces]
    E -. feedback .-> B

The architecture is a loop, not a pipeline. Every answer creates evidence that should improve ingestion, retrieval and evaluation.

Retrieval before generation

For a query qq and candidate chunk dd, a hybrid score can combine semantic and lexical signals:

S(q,d)=αSdense(q,d)+(1α)SBM25(q,d)S(q,d)=\alpha S_{dense}(q,d)+(1-\alpha)S_{BM25}(q,d)

The exact value of α\alpha matters less than measuring performance on a representative evaluation set.

1
2
3
4
5
6
public record RetrievalResult(
String documentId,
String content,
double score,
Map<String, String> metadata
) {}

What to measure

  1. Context recall — did retrieval include the evidence required to answer?
  2. Context precision — how much retrieved material was actually relevant?
  3. Faithfulness — is the answer supported by the supplied context?
  4. End-to-end latency — p50 is pleasant; p95 is reality.
  5. Abstention quality — does the system know when it does not know?

Reliability begins when “I don’t have enough evidence” is treated as a valid product outcome.