Publications By LightOn
[Ettin] Seq vs Seq: An Open Suite of Paired Encoders and Decoders
The large language model (LLM) community focuses almost exclusively ondecoder-only language models, since they are easier to use for text generation.However, a large subset of the community still uses encoder-only models for taskssuch as classification or retrieval. Previous work has attempted to compare thesearchitectures, but is forced to make comparisons with models that have differentnumbers of parameters, training techniques, and datasets. We introduce the SOTAopen-data ETTIN1suite of models: paired encoder-only and decoder-only modelsranging from 17 million parameters to 1 billion, trained on up to 2 trillion tokens.Using the same recipe for both encoder-only and decoder-only models producesSOTA recipes in both categories for their respective sizes, beating ModernBERT asan encoder and Llama 3.2 and SmolLM2 as decoders. Like previous work, we findthat encoder-only models excel at classification and retrieval tasks while decodersexcel at generative tasks. However, we show that adapting a decoder model toencoder tasks (and vice versa) through continued training is subpar compared tousing only the reverse objective (i.e. a 400M encoder outperforms a 1B decoder onMNLI, and vice versa for generative tasks). We open-source all artifacts of thisstudy including training data, training order segmented by checkpoint, and 200+checkpoints to allow future work to analyze or extend all aspects of training.
Pylate-rs: A Rust implementation of Pylate
pylate-rs is a high-performance inference engine for PyLate models, meticulously crafted in Rust for optimal speed and efficiency.
While model training is handled by PyLate, which supports a variety of late interaction models, pylate-rs
is engineered to execute these models at speeds.
- Accelerated Performance: Experience significantly faster model loading and rapid cold starts, making it ideal for serverless environments and low-latency applications.
- Lightweight Design: Built on the Candle ML framework,
pylate-rs
maintains a minimal footprint suitable for resource-constrained systems like serverless functions and edge computing. - Broad Hardware Support: Optimized for diverse hardware, with dedicated builds for standard CPUs, Intel (MKL), Apple Silicon (Accelerate & Metal), and NVIDIA GPUs (CUDA).
- Cross-Platform Integration: Seamlessly integrate
pylate-rs
into your projects with bindings for Python, Rust, and JavaScript/WebAssembly.
For a complete, high-performance multi-vector search pipeline, pair pylate-rs
with its companion library, FastPlaid, at inference time.
Explore our WebAssembly live demo.
BioClinical ModernBERT: A State-of-the-Art Long-Context Encoder for Biomedical and Clinical NLP
Encoder-based transformer models are central to biomedical and clinical Natural Language Processing (NLP), as their bidirectional self-attention makes them well-suited for efficiently extracting structured information from unstructured text through discriminative tasks. However, encoders have seen slower development compared to decoder models, leading to limited domain adaptation in biomedical and clinical settings. We introduce BioClinical ModernBERT, a domain-adapted encoder that builds on the recent ModernBERT release, incorporating long-context processing and substantial improvements in speed and performance for biomedical and clinical NLP. BioClinical ModernBERT is developed through continued pretraining on the largest biomedical and clinical corpus to date, with over 53.5 billion tokens, and addresses a key limitation of prior clinical encoders by leveraging 20 datasets from diverse institutions, domains, and geographic regions, rather than relying on data from a single source. It outperforms existing biomedical and clinical encoders on four downstream tasks spanning a broad range of use cases. We release both base (150M parameters) and large (396M parameters) versions of BioClinical ModernBERT, along with training checkpoints to support further research.
MCPyLate: MCP server using PyLate models for multi-vector search, PLAID.
Multi-vector search has shown very strong performance compared to single dense vector search in numerous domain, including out-of-domain, long-context and reasoning-intensive retrieval. They are thus particularly well suited for modern retrieval use cases, including agentic workflows. PyLate is library built on top of sentence-transformers that allows to easily train and use multi-vector models. This MCP server is a demonstration of the use of PyLate models alongside its index optimized for multi-vector search, PLAID.
Reason-ModernColBERT: A late interaction model trained on the reasoning tasks
Reason-ModernColBERT is a late interaction model trained on the reasonir-hq dataset. It achieves extremely competitive performance on the BRIGHT benchmark aimed at evaluating reasoning-intensive retrieval performance, outperforming all existing models up to 7B (more than 45 times its size) and even surprisingly improving performance of ReasonIR-8B (a 8B model trained on the same data) by more than 2.5 NDCG@10
FastPlaid: A High-Performance Engine for Multi-Vector Search
Traditional vector search relies on single, fixed-size embeddings (dense vectors) for documents and queries. While powerful, this approach can lose nuanced, token-level details.
- Multi-vector search, used in models like ColBERT or ColPali, replaces a single document or image vector with a set of per-token vectors. This enables a "late interaction" mechanism, where fine-grained similarity is calculated term-by-term to boost retrieval accuracy.
- Higher Accuracy: By matching at a granular, token-level, FastPlaid captures subtle relevance that single-vector models simply miss.
- PLAID: stands for Per-Token Late Interaction Dense Search.
- Blazing Performance: Engineered in Rust and optimized for GPUs.
‍
[ModernBERT] Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference
Encoder-only transformer models such as BERT offer a great performance-size tradeoff for retrieval and classification tasks with respect to larger decoder-only models. Despite being the workhorse of numerous production pipelines, there have been limited Pareto improvements to BERT since its release. In this paper, we introduce ModernBERT, bringing modern model optimizations to encoder-only models and representing a major Pareto improvement over older encoders. Trained on 2 trillion tokens with a native 8192 sequence length, ModernBERT models exhibit state-of-the-art results on a large pool of evaluations encompassing diverse classification tasks and both single and multi-vector retrieval on different domains (including code). In addition to strong downstream performance, ModernBERT is also the most speed and memory efficient encoder and is designed for inference on common GPUs.
MonoQwen-Vision, the first visual document reranker
We introduce MonoQwen2-VL-v0.1, the first visual document reranker to enhance the quality of the retrieved visual documents and take these pipelines to the next level. Reranking a small number of candidates with MonoQwen2-VL-v0.1 achieve top results on the ViDoRe leaderboard.
DuckSearch: Search through Hugging Face datasets
DuckSearch is a lightweight Python library built on DuckDB, designed for efficient document search and filtering with Hugging Face datasets and standard documents.
Reducing the Footprint of Multi-Vector Retrieval with Minimal Performance Impact via Token Pooling
Over the last few years, multi-vector retrieval methods, spearheaded by ColBERT, have become an increasingly popular approach to Neural IR. By storing representations at the token level rather than at the document level, these methods have demonstrated very strong retrieval performance, especially in out-of-domain settings. However, the storage and memory requirements necessary to store the large number of associated vectors remain an important drawback, hindering practical adoption. In this paper, we introduce a simple clustering-based token pooling approach to aggressively reduce the number of vectors that need to be stored. This method can reduce the space & memory footprint of ColBERT indexes by 50% with virtually no retrieval performance degradation. This method also allows for further reductions, reducing the vector count by 66%-to-75% , with degradation remaining below 5% on a vast majority of datasets. Importantly, this approach requires no architectural change nor query-time processing, and can be used as a simple drop-in during indexation with any ColBERT-like model.
FC-AMF-OCR Dataset : LightOn releases a 9.3 million images OCR dataset to improve real world document parsing, 2024
With over 9.3 million annotated images, this dataset offers researchers and AI developers a valuable resource for creating models adapted to real world documents.
PyLate: Flexible Training and Retrieval for Late Interaction Models
Neural ranking has become a cornerstone of modern information retrieval. While single vector search remains the dominant paradigm, it suffers from the shortcoming of compressing all the information into a single vector. This compression leads to notable performance degradation in out-of-domain, long-context, and reasoning-intensive retrieval tasks. Multi-vector approaches pioneered by ColBERT aim to address these limitations by preserving individual token embeddings and computing similarity via the MaxSim operator. This architecture has demonstrated superior empirical advantages, including enhanced out-of-domain generalization, long-context handling, and performance in complex retrieval scenarios. Despite these compelling empirical results and clear theoretical advantages, the practical adoption and public availability of late interaction models remain low compared to their single-vector counterparts, primarily due to a lack of accessible and modular tools for training and experimenting with such models. To bridge this gap, we introduce PyLate, a streamlined library built on top of Sentence Transformers to support multi-vector architectures natively, inheriting its efficient training, advanced logging, and automated model card generation while requiring minimal code changes to code templates users are already familiar with. By offering multi-vector-specific features such as efficient indexes, PyLate aims to accelerate research and real-world application of late interaction models, thereby unlocking their full potential in modern IR systems. Finally, PyLate has already enabled the development of state-of-the-art models, including GTE-ModernColBERT and Reason-ModernColBERT, demonstrating its practical utility for both research and production environments.
PyLate: Flexible Training and Retrieval for ColBERT Models
We release PyLate, a new user-friendly library for training and experimenting with ColBERT models, a family of models that exhibit strong retrieval capabilities on out-of-domain data.
ArabicWeb24: Creating a high quality Arabic Web-only pre-training dataset
This blog discusses the pre-processing recipe of the ArabicWeb24 dataset and the evaluation of the process via training different ablation models. It also outlines the impact of the different filtering pipelines on model’s output and on data’s quality.
Training Mamba Models on AMD MI250/MI250X GPUs with Custom Kernels
In this blogpost we show how we can train a Mamba model interchangeably on both NVIDIA and AMD and we compare both training performance and convergence in both cases. This shows that our training stack is becoming more GPU-agnostic.
What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.
Passing the Torch: Training a Mamba Model for Smooth Handover
We present our explorations on training language models based on the new Mamba architecture, which deviates from the traditional Transformer architecture.
LightOn AI Meetup: Creating a Large Dataset for Pretraining LLMs
Reframing Image Difference Captioning with BLIP2IDC and Synthetic Augmentation
The rise of the generative models quality during the past years enabled the generation of edited variations of images at an important scale. To counter the harmful effects of such technology, the Image Difference Captioning (IDC) task aims to describe the differences between two images. While this task is successfully handled for simple 3D rendered images, it struggles on real-world images. The reason is twofold: the training data-scarcity, and the difficulty to capture fine-grained differences between complex images. To address those issues, we propose in this paper a simple yet effective framework to both adapt existing image captioning models to the IDC task and augment IDC datasets. We introduce BLIP2IDC, an adaptation of BLIP2 to the IDC task at low computational cost, and show it outperforms two-streams approaches by a significant margin on real-world IDC datasets. We also propose to use synthetic augmentation to improve the performance of IDC models in an agnostic fashion. We show that our synthetic augmentation strategy provides high quality data, leading to a challenging new dataset well-suited for IDC named Syned1.
Scaling Laws Beyond Backpropagation
NeurIPS 2022 – Workshop: I Can’t Believe It’s Not Better, December 2022
Alternatives to backpropagation have long been studied to better understand how biological brains may learn. Recently, they have also garnered interest as a way to train neural networks more efficiently. By relaxing constraints inherent to backpropagation (e.g., symmetric feedforward and feedback weights, sequential updates), these methods enable promising prospects, such as local learning. However, the tradeoffs between different methods in terms of final task performance, convergence speed, and ultimately compute and data requirements are rarely outlined. In this work, we use scaling laws to study the ability of Direct Feedback Alignment~(DFA) to train causal decoder-only Transformers efficiently. Scaling laws provide an overview of the tradeoffs implied by a modeling decision, up to extrapolating how it might transfer to increasingly large models. We find that DFA fails to offer more efficient scaling than backpropagation: there is never a regime for which the degradation in loss incurred by using DFA is worth the potential reduction in compute budget. Our finding comes at variance with previous beliefs in the alternative training methods community, and highlights the need for holistic empirical approaches to better understand modeling decisions.
High Quality data need not apply: training LLMs with web data only
4th workshop on Neural Scaling Laws: Towards Maximally Beneficial AGI, NeurIPS 2022 – Machine Learning/NLP – LLMsAbstract not available.
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.
What Language Model to Train if You Have One Million GPU Hours?
ACL 2022 Workshop BigScience – LLMs – April 2022As the size of language models continues to grow they become increasingly more powerful and lead to better results, but they also become more expensive to design and train. Given a compute budget that’s enough to train a multilingual transformers language model in the 100B+ parameter scale, our goal is to choose the architecture and the training setup of such a model. Specifically, we perform an ablation study comparing different modelling architectural, which can significantly impact the performance of the resulting models. We focus on the 1.3B parameter scale providing a compromise between the compute cost of the architecture search and the probability that our conclusions hold for the target 100B+ model. In addition, we study the impact of various popular pretraining corpora on the quality of the model. We also study the performance of training a multilingual model and how it compares to the English-only one. Finally, we consider the scaling behaviour of transformer models to choose the target model size, its shape and its training setup.
Shedding a PAC-Bayesian Light on Adaptive Sliced-Wasserstein Distances
Technical Reports and Preprints – Machine Learning
The Sliced-Wasserstein distance (SW) is a computationally efficient and theoretically grounded alternative to the Wasserstein distance. Yet, the literature on its statistical properties with respect to the distribution of slices, beyond the uniform measure, is scarce. To bring new contributions to this line of research, we leverage the PAC-Bayesian theory and the central observation that SW actually hinges on a slice-distribution-dependent Gibbs risk, the kind of quantity PAC-Bayesian bounds have been designed to characterize. We provide four types of results: i) PAC-Bayesian generalization bounds that hold on what we refer as adaptive Sliced-Wasserstein distances, i.e. distances defined with respect to any distribution of slices, ii) a procedure to learn the distribution of slices that yields a maximally discriminative SW, by optimizing our PAC-Bayesian bounds, iii) an insight on how the performance of the so-called distributional Sliced-Wasserstein distance may be explained through our theory, and iv) empirical illustrations of our findings.
RITA: a Study on Scaling Up Generative Protein Sequence Models
Technical Reports and Preprints – Machine Learning, LLMs for Biology
‍In this work we introduce RITA: a suite of autoregressive generative models for protein sequences, with up to 1.2 billion parameters, trained on over 280 million protein sequences belonging to the UniRef-100 database. Such generative models hold the promise of greatly accelerating protein design. We conduct the first systematic study of how capabilities evolve with model size for autoregressive transformers in the protein domain: we evaluate RITA models in next amino acid prediction, zero-shot fitness, and enzyme function prediction, showing benefits from increased scale. We release the RITA models openly, to the benefit of the research community.
A Holistic Assessment of the Carbon Footprint of Noor, a Very Large Arabic Language Model
ACL 2022 Workshop BigScience – LLMs – April 2022As ever-larger language models grow more ubiquitous, it is crucial to consider their environmental impact. Characterised by extreme size and resource use, recent generations of models have been criticised for their voracious appetite for compute, and thus significant carbon footprint. Although reporting of carbon impact has grown more common in machine learning papers, this reporting is usually limited to compute resources used strictly for training. In this work, we propose a holistic assessment of the footprint of an extreme-scale language model, Noor. Noor is an ongoing project aiming to develop the largest multi-task Arabic language models–with up to 13B parameters–leveraging zero-shot generalisation to enable a wide range of downstream tasks via natural language instructions. We assess the total carbon bill of the entire project: starting with data collection and storage costs, including research and development budgets, pretraining costs, future serving estimates, and other exogenous costs necessary for this international cooperation. Notably, we find that inference costs and exogenous factors can have a significant impact on the total budget. Finally, we discuss pathways to reduce the carbon footprint of extreme-scale models.
Binarization for Optical Processing Units via REINFORCE
Conference proceedings – Machine Learning – November 2021
Optical Processing Units (OPUs) are computing devices that perform random projections of input vectors by exploiting the physical phenomenon of scattering a light source through an opaque medium. OPUs have successfully been proposed to carry out approximate kernel ridge regression at scale and with low power consumption by the means of optical random features. OPUs require input vectors to be binary, and this work proposes a novel way to perform supervised data binarization. The main difficulty to develop a solution is that the OPU projection matrices are unknown which poses a challenge in deriving a binarization approach in an end-to-end fashion. Our approach is based on the REINFORCE gradient estimator, which allows us to estimate the gradient of the loss function with respect to binarization parameters by treating the OPU as a black box. Through experiments on several UCI classification and regression problems, we show that our method outperforms alternative unsupervised and supervised binarization techniques.
PAGnol: An Extra-Large French Generative Model
LREC 2022 – LLMs – Initially published: October 2021
Access to large pre-trained models of varied architectures, in many different languages, is central to the democratization of NLP. We introduce PAGnol, a collection of French GPT models. Using scaling laws, we efficiently train PAGnol-XL (1.5B parameters) with the same computational budget as CamemBERT, a model 13 times smaller. PAGnol-XL is the largest model trained to date for the French language. We plan to train increasingly large and performing versions of PAGnol, exploring the capabilities of French extreme-scale models.For this first release, we focus on the pre-training and scaling calculations underlining PAGnol. We fit a scaling law for compute for the French language and compare it with its English counterpart. We find the pre-training dataset significantly conditions the quality of the outputs, with common datasets such as OSCAR leading to low-quality offensive text. We evaluate our models on discriminative and generative tasks in French, comparing them to other state-of-the-art French and multilingual models, and reaching the state of the art in the abstract summarization task. Our research was conducted on the public GENCI Jean Zay supercomputer, and our models up to the Large are made publicly available.
Summary of LightOn AI meetup #14WeightWatcher a Diagnostic Tool for Deep Neural Networks
A high-fidelity and large-scale reconfigurable photonic processor for NISQ applications
Technical Reports and Preprints – Machine Learning
Reconfigurable linear optical networks are a key component for the development of optical quantum information processing platforms in the NISQ era and beyond. We report the implementation of such a device based on an innovative design that uses the mode mixing of a multimode fiber in combination with the programmable wavefront shaping of an SLM. The capabilities of the platform are explored in the classical regime. For up to a record number of 8~inputs and 38~outputs we achieve fidelities in excess of 93%, week-long stability and losses below 6.5dB. The device was built inside a standard server rack to allow for real-world use.