Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

MonoQwen-Vision, the first visual document reranker

Antoine Chaffin, Aurélien Lac

We introduce MonoQwen2-VL-v0.1, the first visual document reranker to enhance the quality of the retrieved visual documents and take these pipelines to the next level. Reranking a small number of candidates with MonoQwen2-VL-v0.1 achieve top results on the ViDoRe leaderboard.

Learn more

DuckSearch: Search through Hugging Face datasets

Raphaël Sourty

DuckSearch is a lightweight Python library built on DuckDB, designed for efficient document search and filtering with Hugging Face datasets and standard documents.

Learn more

Reducing the Footprint of Multi-Vector Retrieval with Minimal Performance Impact via Token Pooling

Authors: Benjamin Clavié, Antoine Chaffin, Griffin Adams

Over the last few years, multi-vector retrieval methods, spearheaded by ColBERT, have become an increasingly popular approach to Neural IR. By storing representations at the token level rather than at the document level, these methods have demonstrated very strong retrieval performance, especially in out-of-domain settings. However, the storage and memory requirements necessary to store the large number of associated vectors remain an important drawback, hindering practical adoption. In this paper, we introduce a simple clustering-based token pooling approach to aggressively reduce the number of vectors that need to be stored. This method can reduce the space & memory footprint of ColBERT indexes by 50% with virtually no retrieval performance degradation. This method also allows for further reductions, reducing the vector count by 66%-to-75% , with degradation remaining below 5% on a vast majority of datasets. Importantly, this approach requires no architectural change nor query-time processing, and can be used as a simple drop-in during indexation with any ColBERT-like model.

Learn more

FC-AMF-OCR Dataset : LightOn releases a 9.3 million images OCR dataset to improve real world document parsing, 2024

Author: Taghadouini Said

With over 9.3 million annotated images, this dataset offers researchers and AI developers a valuable resource for creating models adapted to real world documents.

Learn more

PyLate: Flexible Training and Retrieval for ColBERT Models

Authors: Chaffin Antoine, Sourty Raphaël

We release PyLate, a new user-friendly library for training and experimenting with ColBERT models, a family of models that exhibit strong retrieval capabilities on out-of-domain data.

Learn more

ArabicWeb24: Creating a high quality Arabic Web-only pre-training dataset

Authors: Farhat, May*: LightOn; INSAT., Taghadouini Said: LightOn, Hallström, Oskar: LightOn, Hajri Gabouj, Sonja: INSAT, 2024

This blog discusses the pre-processing recipe of the ArabicWeb24 dataset and the evaluation of the process via training different ablation models. It also outlines the impact of the different filtering pipelines on model’s output and on data’s quality.

Learn more

Training Mamba Models on AMD MI250/MI250X GPUs with Custom Kernels

Authors: Veselka Austin, Taghadouini Said and Hallström Oskar

In this blogpost we show how we can train a Mamba model interchangeably on both NVIDIA and AMD and we compare both training performance and convergence in both cases. This shows that our training stack is becoming more GPU-agnostic.

Learn more

LightOn AI Meetup: Creating a Large Dataset for Pretraining LLMs

Authors: Guilherme Penedo, HuggingFace

Learn more

Passing the Torch: Training a Mamba Model for Smooth Handover

Authors: Hallström, Oskar and Taghadouini, Said and Thiriet, Clément and Chaffin, Antoine

We present our explorations on training language models based on the new Mamba architecture, which deviates from the traditional Transformer architecture.

Learn more

Summary of LightOn AI meetup #14WeightWatcher a Diagnostic Tool for Deep Neural Networks

Learn more

High Quality data need not apply: training LLMs with web data only

Authors: Julien Launay, Guilherme Penedo, Alessandro Cappelli, Baptiste Pannier, Julien Launay, Ruxandra Cojocaru, Ebtesam Almazrouei

4th workshop on Neural Scaling Laws: Towards Maximally Beneficial AGI, NeurIPS 2022 – Machine Learning/NLP – LLMsAbstract not available.

Learn more

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Authors: Teven Le Scao and 300+ authors

Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.

Learn more

Publications by LightOn

MonoQwen-Vision, the first visual document reranker

DuckSearch: Search through Hugging Face datasets

Reducing the Footprint of Multi-Vector Retrieval with Minimal Performance Impact via Token Pooling

FC-AMF-OCR Dataset : LightOn releases a 9.3 million images OCR dataset to improve real world document parsing, 2024

PyLate: Flexible Training and Retrieval for ColBERT Models

ArabicWeb24: Creating a high quality Arabic Web-only pre-training dataset

Training Mamba Models on AMD MI250/MI250X GPUs with Custom Kernels

LightOn AI Meetup: Creating a Large Dataset for Pretraining LLMs

Passing the Torch: Training a Mamba Model for Smooth Handover

Summary of LightOn AI meetup #14WeightWatcher a Diagnostic Tool for Deep Neural Networks

High Quality data need not apply: training LLMs with web data only

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

No matching results found