TL;DR
At small volumes, OCR quality is the only thing that matters. At scale, cost becomes just as structural. A model that reads perfectly but prices you out of production isn't a solution, it's a pilot that never ships.
We built LightOnOCR-2-1B for production.
Measured by someone else, on documents we didn't choose
A developer recently published an open source workbench to compare OCR engines on real-world documents. Four publicly available PDFs: a corporate document, a handwriting sample, a multi-column annual report, a German medical bulletin. Full methodology on GitHub.
Nine engines tested. One scored Excellent on all four documents. Handwriting is where every open source engine fails and Azure Document Intelligence only reaches Good. LightOnOCR-2-1B is the only one to get it right. At 0.5 ct per page: half the cost of Azure, better results on three documents, equal on the fourth.
Small model. Serious results.
LightOnOCR-2-1B runs on standard GPU hardware. It deploys inside your own infrastructure. Your documents stay within your perimeter.
At one million pages per month, the economics are straightforward. At any volume, the architecture is the same: no external dependency, no data exposure, no per-page cloud bill that scales against you.
This is what production-ready looks like.
Credit where it's due
The benchmark referenced in this post was built and published independently by Jonas Wacker. His OCR Workbench is open source, reproducible, and available for anyone to run on their own documents. We had no part in it and that's exactly what makes it valuable.
β
LightOnOCR-2-1B is available on HuggingFace. For enterprise document pipelines with full data sovereignty, explore Paradigm.




.avif)
.avif)
