Publications By LightOn
Reducing the Footprint of Multi-Vector Retrieval with Minimal Performance Impact via Token Pooling
Over the last few years, multi-vector retrieval methods, spearheaded by ColBERT, have become an increasingly popular approach to Neural IR. By storing representations at the token level rather than at the document level, these methods have demonstrated very strong retrieval performance, especially in out-of-domain settings. However, the storage and memory requirements necessary to store the large number of associated vectors remain an important drawback, hindering practical adoption. In this paper, we introduce a simple clustering-based token pooling approach to aggressively reduce the number of vectors that need to be stored. This method can reduce the space & memory footprint of ColBERT indexes by 50% with virtually no retrieval performance degradation. This method also allows for further reductions, reducing the vector count by 66%-to-75% , with degradation remaining below 5% on a vast majority of datasets. Importantly, this approach requires no architectural change nor query-time processing, and can be used as a simple drop-in during indexation with any ColBERT-like model.
FC-AMF-OCR Dataset : LightOn releases a 9.3 million images OCR dataset to improve real world document parsing, 2024
With over 9.3 million annotated images, this dataset offers researchers and AI developers a valuable resource for creating models adapted to real world documents.
PyLate: Flexible Training and Retrieval for ColBERT Models, 2024
We release PyLate, a new user-friendly library for training and experimenting with ColBERT models, a family of models that exhibit strong retrieval capabilities on out-of-domain data.
ArabicWeb24: Creating a high quality Arabic Web-only pre-training dataset, 2024
This blog discusses the pre-processing recipe of the ArabicWeb24 dataset and the evaluation of the process via training different ablation models. It also outlines the impact of the different filtering pipelines on model’s output and on data’s quality.
Training Mamba Models on AMD MI250/MI250X GPUs with Custom Kernels, 2024
In this blogpost we show how we can train a Mamba model interchangeably on both NVIDIA and AMD and we compare both training performance and convergence in both cases. This shows that our training stack is becoming more GPU-agnostic.