EDiTh: Enterprise Search Benchmark for Questions You Can't Outsource

LightOn is releasing EDiTh, an open benchmark where executives can finally see their own questions tested on documents that feel real, without exposing anything confidential.

March 3, 2026

TL;DR

The hardest questions inside a company are confidential. Which inherited contracts expose us to sanctions? Which specification was in effect when that part failed? What non-competes actually bind us across six jurisdictions? You can't hand them to a vendor. You can't put them in an RFP. So you evaluate enterprise search tools on clean demos that test none of the things that will actually break, and you buy on faith.

EDiTh is the benchmark built to answer one question most RAG teams can't answer today: is your system actually ready for enterprise documents? 1000+ real documents, 40 use cases, several languages, full evaluation infrastructure. No synthetic filler. No need for clean PDFs. Just the chaotic documents you live in every day.

What happens when the real documents show up

Every enterprise search deployment follows the same arc. The demo works. The POC goes well on curated data. Then someone says: "Great, now let's try it on our actual documents."

That's when things fall apart, and they fall apart in ways that are hard to diagnose because nobody built evaluation infrastructure for what enterprise documents actually look like. The contracts are in three languages. The scanned files are rotated and degraded. The org charts that explain who owns what subsidiary are implicit, buried in boilerplate, or stored in a system that was decommissioned in 2019.

The vendor says "we need to tune the system." The client hears "it doesn't work." Both are partially right, but neither has the tooling to understand exactly where, why, and how badly it fails.

The fundamental issue is that there has never been a realistic, open test environment that reproduces this complexity because the complexity is always locked behind NDAs.

EDiTh: a corporate universe you can actually test against

EDiTh is a benchmark built around a documentary digital twin, a fictional but rigorously grounded corporate universe designed so that, for the first time, an executive can see her own questions being asked and answered on documents that feel real.

At the center is Véracier Industries S.A., a €1.8B French industrial group operating across aerospace, defense, nuclear energy, and rail. Seven subsidiaries. Five countries. Nine thousand employees. And then, mid-scenario, an acquisition: Précis-Tec, bringing 2,800 inherited contracts with different languages, different formats, different regulatory regimes, and a few buried risks that only surface if the system is genuinely reasoning across documents rather than pattern-matching keywords.

The questions EDiTh asks aren't academic. They're the questions that keep general counsel up at night, that a CISO needs answered by Friday, that a CFO asks on Day 1 after an acquisition closes. The difference is that here, the ground truth is known. The answers exist. And when a system gets it wrong, you can trace exactly why.

The questions that break systems (and why executives recognize them instantly)

What makes EDiTh powerful isn't the volume. It's that the scenarios are the ones executives have actually lived through, but could never use to evaluate a tool.

"Which inherited contracts expose us to sanctions risk?" Day 1 after closing an acquisition. Documents span French, English, German. A Gazprom subsidiary isn't flagged by geography, it's identified through entity chains buried in contract boilerplate. An Algerian agent contract looks like a sanctions risk but is actually an anti-corruption risk. The keywords overlap. The legal consequences don't. Every general counsel who's been through an acquisition knows this question. No benchmark has ever tested it.

"Which of our supplier contracts actually cover this force majeure?" A supply chain crisis hits. Force majeure means different things in different places: force majeure in France, Höhere Gewalt in Germany, Excusable Delays in some English contracts. And some clauses explicitly exclude sourcing disruptions even though they cover "acts of God." A procurement director reading this just nodded. She's seen this exact failure mode, and watched a tool confidently return the wrong answer.

"What specification was in effect when that part was manufactured?" A field failure. The document trail runs through ECN registers, migration logs, and scanned production files from systems now decommissioned. Pages are rotated. Some are degraded. One critical page is missing entirely, and the system needs to know it's missing, not silently work around it. In aerospace and nuclear, getting this wrong has regulatory consequences. For the vendor, it means losing the deal before it starts.

"What non-compete obligations actually exist across our entities?" Six jurisdictions, one question. A California contract contains a non-compete that's void by law, but the contract doesn't say so. A German template is enforceable only if paired with financial compensation. A 40-page scanned batch needs to be decomposed into individual contracts before anything can be reasoned about. The head of M&A has asked this question in every transaction. She's never been able to test whether a tool can answer it.

These aren't edge cases. They're Tuesday morning.

What this actually measures (and why it matters for buying decisions)

EDiTh doesn't produce a single score. It exposes the specific failure modes that determine whether a system will survive contact with real enterprise documents.

Terminology variance. Same legal concept, different names across jurisdictions. If your retrieval pipeline can't handle semantic equivalence across languages, it will systematically miss 20–30% of relevant documents in multilingual corpora. You won't know until a client's lawyer finds the gap.

OCR resilience. Scanned documents with rotation, noise, and degradation aren't exceptions in enterprise document stores, they're the norm. More importantly: can the system detect when a document is too degraded to trust, rather than silently returning wrong information? The difference matters enormously when the output feeds a regulatory filing.

Cross-entity reasoning. Identifying that Company A owns Company B, which has a contract with Company C, requires understanding organizational structure, not just name-matching. This is where most retrieval systems quietly fail in M&A and compliance workflows, and where the cost of failure is highest.

Temporal reasoning. Connecting documents across time: knowing that a specification was revised in 2019, superseded in 2020, and that what matters is the version in effect on a specific date. In regulated industries, this is the difference between a useful system and a dangerous one.

Why this changes the buying conversation

If you're evaluating enterprise search and reasoning tools today, you face a structural problem: you can't test what matters most because what matters most is confidential.

EDiTh changes that equation. It gives you three things that have been genuinely impossible to get until now.

First, a way to see your own questions reflected in a benchmark. Not "can the system find a needle in a haystack," but "can it find the right needle, in the right haystack, when there are fourteen haystacks across seven countries and the needle looks completely different in German?" If you're an executive who has lived through a post-acquisition integration, a cross-border compliance review, or a product liability investigation, you'll recognize these scenarios, and you'll finally have a way to test whether a tool can handle them.

Second, precise failure localization. Not "the system struggles with complex documents" but "your pipeline fails on cross-entity reasoning in multilingual corpora, specifically when entity relationships are implicit rather than named." That's actionable. "It's not good enough" isn't.

Third, an honest answer to the question every vendor will eventually face: "How do you evaluate your system?" Having a benchmark with documented methodology, known ground truth, and real document complexity is a fundamentally different conversation than "we tested it on some representative documents."

What's in the dataset

Over 1,000 documents (78.7 MB) across contracts, reports, policies, and certifications, three PDF formats including scanned batches with realistic artifacts. 33 evaluation use cases covering post-acquisition triage, compliance mapping, temporal reasoning, cross-entity detection, and regulatory analysis. Multiple languages with authentic legal drafting, not translations. Full evaluation infrastructure: ground truth answer keys, metadata index, per-use-case documentation, and automated verification.

The dataset is organized three ways: by document type, by stakeholder role (legal, CISO, HR, finance, operations), and by use case, so you can evaluate the scenarios most relevant to your context, whether you're building a product or buying one.

Why we built this

At LightOn, we build enterprise search and reasoning infrastructure deployed on-premise for aerospace, defense, finance, energy, and the public sector. We've spent years watching the same pattern play out: a tool that looked great in the demo fails on real documents, and nobody has the evaluation infrastructure to understand why, or to have prevented it.

EDiTh came from that frustration. Not from a research exercise, but from real deployments, real failure modes, and a conviction that the field needs better mirrors.

We're releasing it as an open benchmark because the problem it solves isn't ours alone. Every enterprise buyer deserves to test tools against questions that actually look like theirs. Every vendor deserves to know where their system breaks before a client discovers it. Think of it as an Enterprise Digital Twin or a Virtual Company.

The dataset, evaluation code, and baseline results will soon be available on GitHub. Follow us to be notified when it drops.

Because the question your client is asking on Monday morning isn't "does this work on clean PDFs?" It's "can I trust this with the questions I can't show anyone else?"

EDiTh lets you answer that, before they ask.

‍

EDiTh is a project led by Adèle Guignochau and Igor Carron.

Ready to Transform Your Enterprise?

TL;DR

What happens when the real documents show up

EDiTh: a corporate universe you can actually test against

The questions that break systems (and why executives recognize them instantly)

What this actually measures (and why it matters for buying decisions)

Why this changes the buying conversation

What's in the dataset

Why we built this

Recent Blogs

Introducing LightOn Console

LightOn Signs Agreement with Infocom’94 to Accelerate the Deployment of Sovereign Generative AI for Local Authorities

The critical systems your cyber audit never saw

Ready to Transform Your Enterprise?