July 20, 20254 min readby Andre Ross

RAG evaluations that survive production

How we build evaluation pipelines that catch regressions before customers do.

llmragevaluation

Most teams write RAG evaluations once, hit 87% on a notebook, and ship. Three months later the model provider deprecates the embedding endpoint, the corpus drifts, and silently the answers get worse.

The fix is boring but powerful: treat evaluation as a *product surface*, not a one-off script. We write golden sets per use-case, score them on every prompt change *and* every weekly cron, and gate deploys on a numerical threshold the same way unit tests gate code.

What goes into the eval matters. We score for grounding (did the answer cite real source spans?), for refusal calibration (did it correctly say "I don't know"?), and for stylistic conformance to the brand voice. Each gets its own weight. Each gets its own owner.

Ready to build something exceptional?