Web Analytics Made Easy - Statcounter

Introducing OriOn: the SOTA Long-Context Engine That Powers Agentic Search & Reason

Agentic AI starts with retrieval. It scales with long context.

February 18, 2026
Lightbulb

TL;DR

LightOn releases OriOn, a family of long-context visual language models that process up to 250 pages in a single pass. The 32B model achieves SOTA, matching or beating models 7Ă— its size. Combined with Paradigm's Visual RAG, OriOn gives agentic workflows the extended memory they need to reason deeply, chain tools, and execute complex tasks on sovereign infrastructure. Training recipes, benchmarks, and ablation insights are all open.

Enterprise AI is entering a new era. Autonomous agents that orchestrate complex workflows, call multiple tools, chain MCP actions, and make decisions across dozens of sources. But every agentic system faces the same bottleneck: the agent is only as good as the knowledge it can access and reason over.

This is where RAG becomes the foundation of agentic AI. Paradigm's Visual RAG stack gives agents the ability to search millions of documents on sovereign infrastructure and surface exactly the right context, in real time, at every step of a workflow. Retrieval is the agent's first reflex. It's what keeps it grounded, accurate, and connected to your actual enterprise knowledge.

But the most ambitious agentic workflows demand more than retrieval. An agent auditing a regulatory filing needs to hold hundreds of pages in memory while executing a compliance check. An agent orchestrating a due diligence process needs to chain tool calls without losing track of what it read three steps ago. An agent driving a multi-source analysis needs to reason across everything it has gathered, not just the last retrieved chunk.

Today, we're releasing OriOn, a family of long-context visual language models that give Paradigm's RAG-powered agents the extended memory they need.

OriOn processes up to 250 pages at full visual resolution in a single pass. Our 32B-parameter model matches or exceeds models 7Ă— its size on the most challenging long-document benchmarks. State-of-the-art performance, compact enough to deploy on-premise. No 200B+ compute footprint. No dependency on external providers.

The combination is powerful: RAG grounds the agent in the right knowledge. OriOn lets it reason deeply, plan next steps, and drive multi-tool execution across that knowledge. Thanks to prefix caching, each turn in an agentic loop is near-instant, making complex orchestration not just possible but production-ready.

We're sharing everything: training recipes, cleaned benchmarks, and insights from 50+ ablation experiments, all available in our HuggingFace collection. At LightOn, our edge isn't secrecy. It's the speed at which we turn breakthrough research into deployed, sovereign product.

Technical Deep Dive

The "Deep Reader" Companion to RAG

Retrieval-Augmented Generation (RAG) has become the gold standard for searching across massive knowledge bases. It excels at breadth, finding the needle in a haystack of thousands to millions of documents. However, RAG hits a natural limit when the task shifts from finding a document to understanding a complex one.

Most RAG pipelines retrieve a fixed window of context (e.g., top-5 pages or chunks). If a user's question requires synthesizing information across more than that window, for example, "Summarize the evolution of risk factors across this entire 300-page annual report", standard retrieval struggles. It either misses critical context or relies on complex, slow recursive summarization steps.

This is where Long-Context (LC) Visual Language Models (VLMs) come in. They're not a replacement for RAG, they complement it. Combined with visual retrieval, they allow visual only pipelines that skip OCR entirely, processing rendered pages directly. By training models to ingest hundreds of pages (as well as long text sequences) at once, we unlock a new class of capabilities: holistic document analysis, multi-hop question answering and reasoning without retrieval gaps, and handling vague, high-level queries that lack targeted content.

In this post, we share some of our findings for training state-of-the-art long-context visual document models (extending Mistral from 128K to 344K context, equivalent to 250 pages at 1024Ă—1024 resolution) using continued pretraining (CPT), supervised fine tuning (SFT) and LongPO preference optimization. We show the effectiveness of these training methods, various synthetic data strategies, including a novel LC answer generation pipeline that enables self-improvement via SFT, and how they enable a flexible trade-off between retrieval speed and deep reasoning power. Our findings result in reproducible recipes that produce strong visual LC models and are released for the community to explore. We encourage researchers and engineers to check out the leaderboard which makes it easy to compare runs and their training recipes!

2 Why Long-Context? Expanding the Retrieval Horizon

Long-Context VLMs offer a distinct set of capabilities that complement a traditional RAG pipeline:

  • Expanding the "Retrieved" Set: RAG is typically set to pull in up to X pages. If a question spans more than X, the pipeline fails. LC allows us to drastically expand X (e.g., from 5 pages to 100+ pages). This gives developers a new lever: choose a higher X to increase compute intensity in return for better performance on complex queries. In other words, we can tune our retrieval pipeline to be recall oriented instead of precision oriented.
  • Multi-Hop Reasoning: Consider a question like: "Compare the risk factors from 2022 to 2024 and explain how the mitigation strategies evolved." A typical approach decomposes this into sub-questions (What were the 2022 risk factors? The 2024 ones? What mitigation strategies were mentioned?), retrieves chunks for each, and then synthesizes an answer. The problem: with a short context window, you retrieve X chunks for the first sub-question and you've already exhausted your budget. The later sub-questions get squeezed out. With LC, you can retrieve chunks for all sub-questions and pass everything to the model, letting it reason across the full evidence set.
  • Handling Vague & Holistic Questions: RAG relies on specific queries to find semantic matches. If a user asks, "What is the general tone of this document?" or "Are there any formatting inconsistencies?", a retriever has no specific hook. A LC model, however, can process the entire document end-to-end, allowing it to answer high-level questions that require reading the whole story.
  • Efficiency via "One-Pass" Analysis: In a typical document analysis scenario, you call a model with a short context window on chunks, then aggregate answers. This has two downsides:
    1. Information loss: Each call must output what it thinks is necessary to keep, risking error propagation to later steps.
    2. Decoding overhead: Inference has two phases—prefill (fast) and decode (slow). Chunked approaches decode far more tokens.

Consider a document that spans 10 context windows (10C tokens). With chunking:

Chunked: 11C (prefill) + 11T (decode) — 10 chunks + 1 aggregation call, each outputting T tokens

With a LC model, the entire document fits in one pass:

Long-Context: 10C (prefill) + T (decode)

We save massively on decoding—the most expensive part of inference. Additionally, thanks to prefix caching, subsequent turns skip the prefill phase entirely, making multi-turn conversations faster.

3 Experimental Setup and Models

Before detailing our recipes, it is crucial to understand the evaluation landscape. We test our models on a suite of long-context benchmarks, with a primary focus on MMLongBenchDoc, or rather our cleaned version MMLBD-C, currently the most challenging benchmark for LC visual document models. To illustrate the difficulty: an expert human achieves roughly 65.8% accuracy, while GPT-4o achieves only 46.3%. The recent open-weight Qwen3 VL 235B achieved a massive leap in performance to ~57%, yet significant headroom remains.

To systematically explore what works for long-document understanding, we experiment with two distinct model classes:

  • Mistral Small 3.1 (24B): A "weaker" model relative to the current SOTA. We extend its maximum context length to 344K tokens using Continued Pretraining (CPT), training on documents up to 336 pages. Training Mistral allows us to test the impact of our data pipelines when we have access to a strong teacher model.
  • Qwen3 VL (32B): A strong base model which we train to push state-of-the-art performance. While our best performing recipes involve distilling knowledge from the powerful Qwen3 VL 235B, our experiments with Mistral show that our methods work even without a strong teacher. Specifically, our recursive answer generation pipeline enables self-improvement, boosting performance even when the model distills from itself.

3.1 Evaluation Setup

Conducting large sets of ablations requires diverse benchmarks to reduce noise and avoid overfitting to any single test. We employ a suite of long-context benchmarks targeting both visual and textual LC performance, with a focus on long-document understanding.

Benchmarks:

  • Visual LC: MMLongBenchDoc and a corrected version MMLBD-C which we detail below, MMLongBench (at 128K context), DUDE, and SlideVQA
  • Text LC: HELMET and LongBench v2

Aggregate Metrics: Since these benchmarks have different score distributions, we normalize scores by the maximum achieved (typically Qwen3 VL 235B) before averaging:

  • Visual-LC Average (VA): Our primary metric, averaged across all visual LC benchmarks
  • LC Average (LCA): Includes both visual and text LC benchmarks

Since we focus on long-document VQA, Visual-LC Average is our primary metric, with MMLBD-C as the tiebreaker since it is the most challenging and relevant benchmark.

4 Cleaning the Benchmark: MMLBD-C

While reviewing MMLongBenchDoc, we found numerous errors: questions paired with the wrong document, incorrect answers, typos, and underspecified questions. To advance the field of long-document understanding, we manually corrected 251 examples and removed 16, creating MMLBD-C—a more reliable evaluation standard that we release publicly.

5 Recipes for Success: What Works in Practice

We train our models on a combined corpus of web-scraped PDFs and PDFA and additionally include LC text data from “How to Train Long-Context Models (Effectively)”, which we may refer to as ProLong data. More details can be found in the paper! Training these models effectively is non-trivial. Based on our comprehensive study spanning 50+ ablations, we found several key ingredients for success.

5.1 Train with RAG-Like Hard Negatives

To make our models robust to the noise often found in retrieval pipelines, we trained with "hard negatives", pages that are semantically similar to the target content. This simulates real-world RAG scenarios where relevant information might be found across multiple similar documents or retrieved chunks might be distracting.

How we construct hard negatives:

  • We used an in-house Document Screenshot Embedding (DSE) model to mine these difficult distractors from page embeddings.
  • For each target page, we store the top-128 most similar pages from our corpus as candidates for negative samples.
  • We then construct challenging examples by mixing the relevant page(s) with semantically similar but irrelevant "distractor" pages, or by combining similar pages across multiple documents to simulate RAG retrieval scenarios.

Why this matters: Training with hard negatives forces the model to develop fine-grained discrimination abilities. Rather than just "reading," it must actively identify which information is relevant to the question at hand—a critical skill when deployed behind a retrieval system that may surface imperfect results.

Empirical results:

Training DataVisual LC AvgMMLBD-C
With Hard Negatives83.244.0
Documents Only81.642.4
Improvement+1.6+1.6

5.2 Page Indices for Navigation

Navigating a long document requires a map. We found that simply prepending explicit page numbers (e.g., "Page 1", "Page 2") to the visual context provides a massive boost—but only if you do this during both training and at deployment.

Method: We prepend a minimal text header to each image in the sequence:

Copy

Page 1:

<image>

Page 2:

<image>

...

Why it matters: This simple intervention provides explicit positional information that helps the model in several ways:

  • Precise citation: The model can reference specific pages in its answers, making outputs more verifiable and trustworthy.
  • Cross-document reasoning: Users can issue instructions like "Compare the chart on Page 10 with the conclusion on Page 50," and the model can follow them.
  • Structural awareness: Page indices help the model understand document structure and navigate between sections more effectively.

Critical insight: Training is necessary. A key finding from our experiments is that adding page indices at evaluation time only does not improve performance, in fact, it may hurt performance (-1.0 points on Visual LC Avg), but the difference is not highly significant. The model must be trained with page indices to benefit from them at inference.

TrainEvalVisual LC AvgMMLBD-C
YesYes83.545.1
NoNo80.842.3
NoYes79.842.5

5.3 Match Training Length to Evaluation

Previous wisdom from "How to Train Long-Context Language Models (Effectively)" suggests that training on context lengths significantly longer than the evaluation target is beneficial. Our analysis contradicts this for document understanding—and we discovered why previous work showed different results.

The surprising finding: When we compared a model trained only on our "short stage" (up to 104 pages) versus one trained on both short and long stages (up to 336 pages), the short-stage model actually performed better across the board. This held true across multiple experiments:

  • SFT on Mistral: Short stage outperformed by +3.0 points on Visual LC Avg
  • SFT on Qwen3 VL: Short stage outperformed by +1.4 points on Visual LC Avg
  • LongPO on Qwen3 VL: Short stage outperformed by +2.2 points on Visual LC Avg

Why does this contradict previous work? We reconciled this apparent contradiction by examining the training data distributions. Datasets like ProLong's 512K stage, which report benefits from "longer" training, are actually heavily short-skewed:

  • ProLong 512K: Max 512K tokens, but median only 484 tokens
  • Our long stage: Median 156 images per example (genuinely long)

Previous work wasn't actually training on long contexts—their "long" datasets were mostly short examples with a few long outliers. In contrast, our long stage contained genuinely long documents. When you train on truly long contexts that exceed your evaluation distribution, the model either prefers very long documents or suffers from training on more complex/noisy examples.

Practical takeaway: For our evaluation with MMLongBenchDoc, our short stage (with a mean of 21 pages, compared to 47 pages for MMLongBenchDoc) with up to ~104 pages appears to be the sweet spot. This covers the vast majority of reports, contracts, and academic papers. While we did not have time to explore different or more closely matched length distributions further, matching your training distribution to this expected inference length is more effective than indiscriminately maximizing context length.

ModelTrainingVisual LC AvgMMLBD-C
Mistral SFTShort Stage84.145.0
Mistral SFTShort + Long81.143.3
Qwen3 VL SFTShort Stage92.057.3
Qwen3 VL SFTShort + Long90.657.0
Qwen3 VL LongPOShort Stage94.056.4
Qwen3 VL LongPOShort + Long91.954.0

5.4 Self-Improvement via Recursive Generation

One of our key contributions is a novel recursive answer generation pipeline that enables self-improvement—the model can bootstrap its own long-context capabilities without requiring a stronger teacher.

The problem with existing approaches: Most LC VLM training relies on distillation from a more powerful teacher (e.g., Qwen3 VL 235B). Existing preference optimization methods like LongPO default to generating "preferred" responses from the short context where the question originated, treating the rest of the document as irrelevant, training the model to look for a small, localized set of relevant pages from the full document, rather than considering all of the content.

Our recursive pipeline:

  1. Page-by-page evidence extraction: Given a question, the model extracts evidence relevant to the question from each page individually—a short-context task it already excels at.
  2. Relevance ranking: The extraction model provides a numerical score for each page. We rank pages by relevance.
  3. Answer synthesis: We pass the most relevant pages (or their extracted evidence) to the answer generator, which synthesizes a final response.

The key insight is that this distills an algorithm into the model—a systematic search over the full context—rather than just distilling answers from localized subsets. The pipeline searches the entire document for relevant evidence, producing more comprehensive and accurate answers. We also note that CPT is a form of self-improvement, as we generate long samples using Mistral itself and this shows strong performance as well.

Effectiveness: The fact that this pipeline enables strong self-improvement validates its design. These gains indicate the recursive pipeline generates genuinely high-quality training signal, not just noisy approximations.

MethodVisual LC AvgMMLBD-C
Mistral Base80.241.4
+ Self-Improving SFT83.4 (+3.2)45.2 (+3.8)
+ Self-Improving CPT84.0 (+3.8)42.7 (+1.3)

5.5 Visual Training Improves Text Understanding

Interestingly, we found that training on long visual documents transfers strongly to text-only performance. This extends findings from prior work that showed the reverse (text-to-visual transfer), demonstrating that long-context capabilities are somewhat modality-agnostic.

The experiment: We applied CPT to Mistral without any long-context text data (excluding the ProLong text corpus) and measured text-only performance on HELMET, a challenging long-context text benchmark.

Results:

  • Before visual CPT: HELMET score of 37.0
  • After visual-only CPT: HELMET score of 48.5 (+11.5 points)

This is a substantial improvement on a purely text benchmark from training on only visual long-context data—the visual training itself strengthens the model's fundamental ability to maintain coherence over long sequences.

Why this matters:

  • Unified long-context capabilities: Visual and text long-context understanding share underlying mechanisms. Training one improves the other.
  • Practical implication: Even if your primary workflow involves text documents, investing in visual long-context training will improve your model's overall long-context capabilities.
  • Data efficiency: For LC text models, traditional long-form data comes from code, books, etc. But visual documents (PDFs) are abundant and highly relevant to enterprise models. This result suggests they can be leveraged to improve text performance as well.

5.6 CPT is Not Always Necessary

The standard recipe for extending context length is CPT on long-form data. In our quest to study the minimal necessary pretraining, we made a surprising finding: SFT alone is competitive with CPT + SFT, meaning the CPT is not always necessary. This may be due to Mistral's large RoPE θ, allowing it to easily adapt to longer sequences, but we leave this exploration to future work.

The experiment: We compared two training paths for Mistral:

  1. SFT directly from the Instruct checkpoint (no CPT)
  2. SFT from a checkpoint that had undergone 100B tokens of CPT

Results:

MethodVisual LC AvgMMLBD-CHELMET
SFT Only84.445.447.1
SFT from CPT84.045.152.0

When to use CPT:

  • Skip CPT if: Your base model's context length is sufficient for your target benchmarks, you're compute-constrained, and you primarily care about visual long-document performance.
  • Use CPT if: You need to extend context length beyond the base model's capacity, you want strong text LC performance, or you have ample compute and want the most robust model.

CPT advantages: Even when not strictly necessary, CPT has benefits: the data is extremely scalable (our tasks like Fill-in-Middle and Unshuffle require annotation of only a single page per long-context example, or are entirely programmatic), requires no strong teacher model, and provides the largest gains on text LC benchmarks.

5.7 LongPO: Preference Optimization for Long Context

Beyond SFT, we explored LongPO—a preference optimization method specifically designed for extending short-context capabilities to long-context inputs.

How LongPO works: The core insight is that models typically perform well on short contexts but struggle as context length increases. LongPO exploits this by:

  1. Generate a question from a short subset of the document (e.g., a few relevant pages).
  2. Create a "preferred" response by answering the question given only the short context—where the model excels.
  3. Create a "rejected" response by answering the same question given the full long context—where the model may struggle or hallucinate.
  4. Train the model to prefer the short-context answer even when presented with the full long context.

This teaches the model to maintain its short-context quality when faced with longer inputs, effectively transferring its strong short-context capabilities to long-context scenarios.

Results: LongPO achieved our highest Visual LC Average (94.0), outperforming SFT alone. However, it requires more than 2Ă— the compute of SFT (since you need to process the long context for both the chosen and rejected responses during training). For maximizing MMLBD-C specifically, plain distillation with SFT was slightly more effective.

MethodVisual LC AvgMMLBD-C
LongPO94.056.4
SFT (Plain Distill)92.057.3
Qwen3 VL Base93.753.8

5.8 Performance

Our recipes achieve State-of-the-Art (SOTA) performance; our LongPO run achieves the highest Visual LC Average, and our SFT model sets the SOTA on MMLBD-C. While there are no other models at the 24B parameter scale on MMLongBenchDoc or MMLBD-C, our Mistral checkpoints are a significant improvement on the Pareto frontier at this scale.

On the strength of LongPO vs SFT, we note that our LongPO checkpoint is a product of all of our findings with SFT: use of hard negatives and page indices, a strong page distribution and skipping CPT, ultimately culminating in our best performing recipe which gives strong gains on MMLBD-C while also not degrading performance overall. We conclude that LongPO is a stronger objective than SFT, but also that practitioners may still choose SFT under key constraints: a limited compute budget (LongPO is more than twice as computationally expensive as SFT, see the paper for details) or when willing to accept the specialization of the model (we see SFT is capable of strong improvements in the targeted area, a huge 3.5 point improvement in MMLBD-C, despite a 1.7 point drop in VA). Additionally, while SFT degrades Qwen3 VL 32B overall, in contrast, CPT + SFT for Mistral is very strong, improving VA performance by 4.2 points and MMLBD-C by 6 points and suggesting that for weaker or smaller models, these methods are still very effective. 

ModelVisual LC AvgMMLBD-CMMLB 128K
Qwen3 VL 235B (Teacher)98.456.278.6
LongPO Short Stage (Ours)94.056.475.6
Qwen3 VL 32B Instruct (Base)93.753.870.4
Qwen3 VL 32B Plain Distill (Ours)92.057.373.8
Mistral 24B Plain Distill (Ours)84.447.465.7
Mistral 3.1 Small (Base)80.241.466.4

6 Conclusion

We have presented the first comprehensive study of training long-context visual document models, spanning continued pretraining, supervised finetuning, and preference optimization. Our findings challenge some prevailing assumptions: CPT is not always necessary if the base model is capable, and matching training context length to evaluation length often yields better results than simply training on the longest possible sequences. We also showed that simple, high-utility interventions like explicit page indices provide substantial gains with minimal effort.

Crucially, we demonstrated that visual long-context training is not isolated—it transfers to long-context text performance, suggesting a deeper underlying capability for coherence over long horizons. Our synthetic data pipelines, capable of self-improvement, offer a path forward even without access to the strongest teacher models.

Long-Context VLMs are not a replacement for RAG, but a powerful extension of it. By effectively training models to handle long visual documents, we can offer a "Deep Reader" capability that handles the complex, holistic, and vague queries that traditional retrieval misses. With the release of our recipes and the cleaned MMLBD-C benchmark, we hope to accelerate progress in long-document understanding and help the community build more robust, capable vision-language systems.

Links

Ready to Transform Your Enterprise?

Recent Blogs

Ready to Transform Your Enterprise?