Web Analytics Made Easy - Statcounter

Publications de LightOn

Trier par:
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

MCPyLate: MCP server using PyLate models for multi-vector search, PLAID.

Antoine Chaffin

Multi-vector search has shown very strong performance compared to single dense vector search in numerous domain, including out-of-domain, long-context and reasoning-intensive retrieval. They are thus particularly well suited for modern retrieval use cases, including agentic workflows. PyLate is library built on top of sentence-transformers that allows to easily train and use multi-vector models. This MCP server is a demonstration of the use of PyLate models alongside its index optimized for multi-vector search, PLAID.

FastPlaid: A High-Performance Engine for Multi-Vector Search

Raphaël Sourty

Traditional vector search relies on single, fixed-size embeddings (dense vectors) for documents and queries. While powerful, this approach can lose nuanced, token-level details.

  • Multi-vector search, used in models like ColBERT or ColPali, replaces a single document or image vector with a set of per-token vectors. This enables a "late interaction" mechanism, where fine-grained similarity is calculated term-by-term to boost retrieval accuracy.
  • Higher Accuracy: By matching at a granular, token-level, FastPlaid captures subtle relevance that single-vector models simply miss.
  • PLAID: stands for Per-Token Late Interaction Dense Search.
  • Blazing Performance: Engineered in Rust and optimized for GPUs.

Reason-ModernColBERT: A late interaction model trained on the reasoning tasks

Antoine Chaffin

Reason-ModernColBERT is a late interaction model trained on the reasonir-hq dataset. It achieves extremely competitive performance on the BRIGHT benchmark aimed at evaluating reasoning-intensive retrieval performance, outperforming all existing models up to 7B (more than 45 times its size) and even surprisingly improving performance of ReasonIR-8B (a 8B model trained on the same data) by more than 2.5 NDCG@10

Pylate-rs: A Rust implementation of Pylate

Raphael Sourty

pylate-rs is a high-performance inference engine for PyLate models, meticulously crafted in Rust for optimal speed and efficiency.

While model training is handled by PyLate, which supports a variety of late interaction models, pylate-rs is engineered to execute these models at speeds.

  • Accelerated Performance: Experience significantly faster model loading and rapid cold starts, making it ideal for serverless environments and low-latency applications.
  • Lightweight Design: Built on the Candle ML framework, pylate-rs maintains a minimal footprint suitable for resource-constrained systems like serverless functions and edge computing.
  • Broad Hardware Support: Optimized for diverse hardware, with dedicated builds for standard CPUs, Intel (MKL), Apple Silicon (Accelerate & Metal), and NVIDIA GPUs (CUDA).
  • Cross-Platform Integration: Seamlessly integrate pylate-rs into your projects with bindings for Python, Rust, and JavaScript/WebAssembly.

For a complete, high-performance multi-vector search pipeline, pair pylate-rs with its companion library, FastPlaid, at inference time.

Explore our WebAssembly live demo.

Reframing Image Difference Captioning with BLIP2IDC and Synthetic Augmentation

Gautier Evennou, Antoine Chaffin, Vivien Chappelier, Ewa Kijak

The rise of the generative models quality during the past years enabled the generation of edited variations of images at an important scale. To counter the harmful effects of such technology, the Image Difference Captioning (IDC) task aims to describe the differences between two images. While this task is successfully handled for simple 3D rendered images, it struggles on real-world images. The reason is twofold: the training data-scarcity, and the difficulty to capture fine-grained differences between complex images. To address those issues, we propose in this paper a simple yet effective framework to both adapt existing image captioning models to the IDC task and augment IDC datasets. We introduce BLIP2IDC, an adaptation of BLIP2 to the IDC task at low computational cost, and show it outperforms two-streams approaches by a significant margin on real-world IDC datasets. We also propose to use synthetic augmentation to improve the performance of IDC models in an agnostic fashion. We show that our synthetic augmentation strategy provides high quality data, leading to a challenging new dataset well-suited for IDC named Syned1.

PyLate: Flexible Training and Retrieval for Late Interaction Models

Antoine Chaffin, Raphaël Sourty

Neural ranking has become a cornerstone of modern information retrieval. While single vector search remains the dominant paradigm, it suffers from the shortcoming of compressing all the information into a single vector. This compression leads to notable performance degradation in out-of-domain, long-context, and reasoning-intensive retrieval tasks. Multi-vector approaches pioneered by ColBERT aim to address these limitations by preserving individual token embeddings and computing similarity via the MaxSim operator. This architecture has demonstrated superior empirical advantages, including enhanced out-of-domain generalization, long-context handling, and performance in complex retrieval scenarios. Despite these compelling empirical results and clear theoretical advantages, the practical adoption and public availability of late interaction models remain low compared to their single-vector counterparts, primarily due to a lack of accessible and modular tools for training and experimenting with such models. To bridge this gap, we introduce PyLate, a streamlined library built on top of Sentence Transformers to support multi-vector architectures natively, inheriting its efficient training, advanced logging, and automated model card generation while requiring minimal code changes to code templates users are already familiar with. By offering multi-vector-specific features such as efficient indexes, PyLate aims to accelerate research and real-world application of late interaction models, thereby unlocking their full potential in modern IR systems. Finally, PyLate has already enabled the development of state-of-the-art models, including GTE-ModernColBERT and Reason-ModernColBERT, demonstrating its practical utility for both research and production environments.

BioClinical ModernBERT: A State-of-the-Art Long-Context Encoder for Biomedical and Clinical NLP

Thomas Sounack, Joshua Davis, Brigitte Durieux, Antoine Chaffin, Tom J. Pollard, Eric Lehman, Alistair E. W. Johnson, Matthew McDermott, Tristan Naumann, Charlotta Lindvall

Encoder-based transformer models are central to biomedical and clinical Natural Language Processing (NLP), as their bidirectional self-attention makes them well-suited for efficiently extracting structured information from unstructured text through discriminative tasks. However, encoders have seen slower development compared to decoder models, leading to limited domain adaptation in biomedical and clinical settings. We introduce BioClinical ModernBERT, a domain-adapted encoder that builds on the recent ModernBERT release, incorporating long-context processing and substantial improvements in speed and performance for biomedical and clinical NLP. BioClinical ModernBERT is developed through continued pretraining on the largest biomedical and clinical corpus to date, with over 53.5 billion tokens, and addresses a key limitation of prior clinical encoders by leveraging 20 datasets from diverse institutions, domains, and geographic regions, rather than relying on data from a single source. It outperforms existing biomedical and clinical encoders on four downstream tasks spanning a broad range of use cases. We release both base (150M parameters) and large (396M parameters) versions of BioClinical ModernBERT, along with training checkpoints to support further research.

[Ettin] Seq vs Seq: An Open Suite of Paired Encoders and Decoders

Orion Weller, Kathryn Ricci, Marc Marone, Antoine Chaffin, Dawn Lawrie, Benjamin Van Durme

The large language model (LLM) community focuses almost exclusively ondecoder-only language models, since they are easier to use for text generation.However, a large subset of the community still uses encoder-only models for taskssuch as classification or retrieval. Previous work has attempted to compare thesearchitectures, but is forced to make comparisons with models that have differentnumbers of parameters, training techniques, and datasets. We introduce the SOTAopen-data ETTIN1suite of models: paired encoder-only and decoder-only modelsranging from 17 million parameters to 1 billion, trained on up to 2 trillion tokens.Using the same recipe for both encoder-only and decoder-only models producesSOTA recipes in both categories for their respective sizes, beating ModernBERT asan encoder and Llama 3.2 and SmolLM2 as decoders. Like previous work, we findthat encoder-only models excel at classification and retrieval tasks while decodersexcel at generative tasks. However, we show that adapting a decoder model toencoder tasks (and vice versa) through continued training is subpar compared tousing only the reverse objective (i.e. a 400M encoder outperforms a 1B decoder onMNLI, and vice versa for generative tasks). We open-source all artifacts of thisstudy including training data, training order segmented by checkpoint, and 200+checkpoints to allow future work to analyze or extend all aspects of training.

[ModernBERT] Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Griffin Adams, Jeremy Howard, Iacopo Poli

Encoder-only transformer models such as BERT offer a great performance-size tradeoff for retrieval and classification tasks with respect to larger decoder-only models. Despite being the workhorse of numerous production pipelines, there have been limited Pareto improvements to BERT since its release. In this paper, we introduce ModernBERT, bringing modern model optimizations to encoder-only models and representing a major Pareto improvement over older encoders. Trained on 2 trillion tokens with a native 8192 sequence length, ModernBERT models exhibit state-of-the-art results on a large pool of evaluations encompassing diverse classification tasks and both single and multi-vector retrieval on different domains (including code). In addition to strong downstream performance, ModernBERT is also the most speed and memory efficient encoder and is designed for inference on common GPUs.

MonoQwen-Vision, the first visual document reranker

Antoine Chaffin, Aurélien Lac

We introduce MonoQwen2-VL-v0.1, the first visual document reranker to enhance the quality of the retrieved visual documents and take these pipelines to the next level. Reranking a small number of candidates with MonoQwen2-VL-v0.1 achieve top results on the ViDoRe leaderboard.

DuckSearch: Search through Hugging Face datasets

Raphaël Sourty

DuckSearch is a lightweight Python library built on DuckDB, designed for efficient document search and filtering with Hugging Face datasets and standard documents.

Reducing the Footprint of Multi-Vector Retrieval with Minimal Performance Impact via Token Pooling

Authors: Benjamin Clavié, Antoine Chaffin, Griffin Adams

Over the last few years, multi-vector retrieval methods, spearheaded by ColBERT, have become an increasingly popular approach to Neural IR. By storing representations at the token level rather than at the document level, these methods have demonstrated very strong retrieval performance, especially in out-of-domain settings. However, the storage and memory requirements necessary to store the large number of associated vectors remain an important drawback, hindering practical adoption. In this paper, we introduce a simple clustering-based token pooling approach to aggressively reduce the number of vectors that need to be stored. This method can reduce the space & memory footprint of ColBERT indexes by 50% with virtually no retrieval performance degradation. This method also allows for further reductions, reducing the vector count by 66%-to-75% , with degradation remaining below 5% on a vast majority of datasets. Importantly, this approach requires no architectural change nor query-time processing, and can be used as a simple drop-in during indexation with any ColBERT-like model.

No matching results found

We couldn’t find what you searched for. Try different keywords.