NOVA: A Guide to Actually Measuring How Your Agent Works on Your Data

Numbers Over Vibes: Building a RAG Evaluation Framework That Actually Works

March 11, 2026

TL;DR

Building Evaluation From the Ground Up

The most natural place to start evaluating a RAG system is the answer. Does the system respond correctly? It’s tempting to start and stop here, but it's not enough. Here's how we think about layering evaluation, starting from the obvious and expanding outward.

*Every NOVA run tracked in* *Weave: model candidate, metrics per stage, and full experiment lineage in one place.*

Layer 1: Retrieval and generation

At its simplest, RAG is "retrieve relevant chunks, generate an answer." So the first layer of evaluation is straightforward: did you retrieve the right documents, and is the answer any good?

For retrieval, this means standard IR metrics: precision, recall, NDCG, measured at various cutoffs. These tell you whether the system is surfacing the right content before a single token is generated.

For generation, the question is harder. Evaluating answer quality is a problem NLP has wrestled with for decades: BLEU, ROUGE, BERTScore, and their variants all tried to quantify text quality by comparing against reference answers, but they fall apart when many different phrasings can all be correct.

LLM-as-judge approaches have made this more tractable, but bring their own pitfalls: they're biased toward verbose, confident-sounding answers, they can be inconsistent across runs, and they need careful calibration. Naively asking an LLM to "rate this answer 1-10" is a vibe check with extra steps.

What works is constrained, structured evaluation: judges that score specific dimensions on defined rubrics, carefully calibrated so they're consistent across runs. A dedicated refusal judge to ensure the model declines when it lacks the necessary sources. A faithfulness judge to catch answers that drift from the retrieved content. These sit alongside deterministic checks that don't need an LLM at all: keyword coverage, language consistency between query and response. Our key insight is that one mega-judge trying to assess everything at once performs worse than a panel of focused evaluators.

Layer 2: Reranking – is it worth the latency?

Modern RAG systems usually include a reranker between retrieval and generation: it re-scores retrieved chunks with a more powerful (and slower) model, pushing the most relevant results to the top. It also may allow you to separate actually relevant documents from noisy ones.

In theory, this improves precision. In practice, you need to verify it's not just adding latency. We measure retrieval quality before and after reranking with the same metrics. Without that comparison, you can't tell if the reranker is actually helping, doing nothing, or making things worse on certain query types.

Layer 3: Parsing and chunking – what happens before retrieval even starts

Documents aren't born as clean text chunks. They go through parsing (OCR, PDF extraction, table detection) and chunking (splitting into retrievable units).

While this layer sits outside the typical "RAG" mental model, it is crucial to evaluate: garbage in, garbage out. In practice, parsing failures are some of the most common and most invisible sources of degradation. We've lost count of how many times a "hallucinating model" turned out to be a parsing problem: the answer wasn't wrong, the model just never had the right content to work with. We evaluate parsing coverage and chunk quality as first-class metrics, not afterthoughts.

Layer 4: Routing and agentic decisions – the layer on top

In a simple RAG system, every query goes through the same pipeline. But in agentic system, an intelligent agent decides, before a single document is fetched: should I retrieve at all? Which tool should I use? How should I reformulate the query? Should I search one collection or three?

As we discussed in RAG is Dead, Long Live RAG, this decision stack is where modern RAG lives. And it's another layer that needs its own evaluation: routing accuracy, query writing quality, tool selection precision. A routing error invalidates everything downstream.

The full picture

Each layer adds evaluation surface. By the time you've built all four, you're measuring your system from document ingestion to final answer, at every decision point, with quality and latency tracked together across stages.

The principle: you can only improve what you measure, and you can only debug what you've instrumented.

Academic Benchmarks: Hypothesis, Not Conclusion

Academic benchmarks are essential for comparing model capabilities in controlled conditions. This is why we're proud of having LightOnOCR leading OlmOCR-Bench, ColBERT-Zero and Reason-ModernColBERT ranking among the top retrieval models on BEIR and BRIGHT, and OriOn pushing the state of the art on long-context benchmarks. We care about their quality too: we released MMLBD-C, a corrected version of MMLongBench-Doc, and contribute upstream to MTEB, from adding multi-vector support to fixing scoring bugs and dataset issues in BRIGHT and LoTTE.

But benchmarks keep you honest about models, not about systems. They can't tell you whether a model works inside your pipeline, on your documents, with your query distribution. The public leaderboard is the hypothesis; the eval pipeline is the experiment. This is why NOVA exists: when we evaluate a new model, it goes through public benchmarks and our internal suite, across multiple datasets, languages, and document types. If a model tops the leaderboard but regresses on table-heavy documents in our pipeline, we know before it ships.

Evaluation Is a Living Practice

It's tempting to treat evaluation as a gate you pass once before shipping. But your system isn't static: new data sources introduce document types your chunking strategy wasn't designed for, LLMs don't like the same prompts forever, user behavior evolves, and agentic systems add new decision points with each iteration.

Software engineering learned this decades ago: the earlier you catch a bug, the cheaper it is to fix. The same applies to ML pipelines. When a new model degrades retrieval on a specific query type, you want to know before it ships. When a new model being too verbose by default introduces a latency regression, you want to catch it in CI, not in production monitoring.

This is the shift-left principle applied to RAG: move testing earlier in the development cycle, run your evaluation suite on every significant change, and make it part of the development loop rather than an afterthought. Here's what that looks like in practice.

Evaluation datasets that match your users. We evaluate across multiple domains, languages, document types, and difficulty levels, from simple factoid retrieval to multi-hop reasoning over mixed text-and-visual documents. Handcrafted datasets from real usage patterns alongside adapted academic benchmarks.

Every change goes through the eval suite. New model candidate? New chunking strategy? New prompt template? It runs through NOVA before it goes anywhere near production. We love using Weave, as it gives us full experiment lineage: what changed, what moved, and whether we should ship it. Nightly monitoring tracks drift across model configurations, so gradual regressions don't slip through.

Quality and performance tracked together. Every eval run measures per-stage latency alongside quality scores. A query expansion method that improves precision but doubles retrieval time is a tradeoff you want to see before shipping, not after.

Evaluation doesn't just measure, it drives optimization. Beyond pass/fail gating, we use NOVA as the fitness function for automated optimization, using frameworks such as GEPA, then validate the optimized configuration with a full end-to-end eval run before syncing to production. The eval pipeline isn't just a quality gate; it's the mechanism that makes the system better.

The Payoff

Without systematic evaluation, you can't tell the difference between a real improvement and a regression you haven't noticed yet. You find out in production, weeks later, from a customer.

With NOVA, those surprises become things we catch in CI. A few real examples:

Building this evaluation machinery is an investment. It takes time, expertise, and a willingness to slow down before you speed up. But it compounds: less time arguing about architecture, less time debugging production issues blind, and faster iteration cycles because you know exactly where to focus.

‍

At LightOn, we build Paradigm, a secure enterprise RAG platform. NOVA is our north star: it's how we ensure Paradigm delivers on its promises, and how we help our customers build the same confidence in their own systems.

‍

Ready to Transform Your Enterprise?