Publications | LightOn

FC-AMF-OCR Dataset : LightOn releases a 9.3 million images OCR dataset to improve real world document parsing, 2024

Author: Taghadouini Said

With over 9.3 million annotated images, this dataset offers researchers and AI developers a valuable resource for creating models adapted to real world documents.

PyLate: Flexible Training and Retrieval for ColBERT Models

Authors: Chaffin Antoine, Sourty Raphaël

We release PyLate, a new user-friendly library for training and experimenting with ColBERT models, a family of models that exhibit strong retrieval capabilities on out-of-domain data.

ArabicWeb24: Creating a high quality Arabic Web-only pre-training dataset

Authors: Farhat, May*: LightOn; INSAT., Taghadouini Said: LightOn, Hallström, Oskar: LightOn, Hajri Gabouj, Sonja: INSAT, 2024

This blog discusses the pre-processing recipe of the ArabicWeb24 dataset and the evaluation of the process via training different ablation models. It also outlines the impact of the different filtering pipelines on model’s output and on data’s quality.

Training Mamba Models on AMD MI250/MI250X GPUs with Custom Kernels

Authors: Veselka Austin, Taghadouini Said and Hallström Oskar

In this blogpost we show how we can train a Mamba model interchangeably on both NVIDIA and AMD and we compare both training performance and convergence in both cases. This shows that our training stack is becoming more GPU-agnostic.

LightOn AI Meetup: Creating a Large Dataset for Pretraining LLMs

Authors: Guilherme Penedo, HuggingFace

Passing the Torch: Training a Mamba Model for Smooth Handover

Authors: Hallström, Oskar and Taghadouini, Said and Thiriet, Clément and Chaffin, Antoine

We present our explorations on training language models based on the new Mamba architecture, which deviates from the traditional Transformer architecture.

Summary of LightOn AI meetup #14WeightWatcher a Diagnostic Tool for Deep Neural Networks

High Quality data need not apply: training LLMs with web data only

Authors: Julien Launay, Guilherme Penedo, Alessandro Cappelli, Baptiste Pannier, Julien Launay, Ruxandra Cojocaru, Ebtesam Almazrouei

4th workshop on Neural Scaling Laws: Towards Maximally Beneficial AGI, NeurIPS 2022 – Machine Learning/NLP – LLMsAbstract not available.

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Authors: Teven Le Scao and 300+ authors

Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.

What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?

Authors: Teven Le Scao and 300+ authors

Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.

RITA: a Study on Scaling Up Generative Protein Sequence Models

Authors: Daniel Hesslow, Niccoló Zanichelli, Pascal Notin, Iacopo Poli, Debora Marks

Technical Reports and Preprints – Machine Learning, LLMs for Biology
‍In this work we introduce RITA: a suite of autoregressive generative models for protein sequences, with up to 1.2 billion parameters, trained on over 280 million protein sequences belonging to the UniRef-100 database. Such generative models hold the promise of greatly accelerating protein design. We conduct the first systematic study of how capabilities evolve with model size for autoregressive transformers in the protein domain: we evaluate RITA models in next amino acid prediction, zero-shot fitness, and enzyme function prediction, showing benefits from increased scale. We release the RITA models openly, to the benefit of the research community.

A Holistic Assessment of the Carbon Footprint of Noor, a Very Large Arabic Language Model

Authors: Imad Lakim, Ebtesam Almazrouei, Merouane Debbah, Julien Launay

ACL 2022 Workshop BigScience – LLMs – April 2022As ever-larger language models grow more ubiquitous, it is crucial to consider their environmental impact. Characterised by extreme size and resource use, recent generations of models have been criticised for their voracious appetite for compute, and thus significant carbon footprint. Although reporting of carbon impact has grown more common in machine learning papers, this reporting is usually limited to compute resources used strictly for training. In this work, we propose a holistic assessment of the footprint of an extreme-scale language model, Noor. Noor is an ongoing project aiming to develop the largest multi-task Arabic language models–with up to 13B parameters–leveraging zero-shot generalisation to enable a wide range of downstream tasks via natural language instructions. We assess the total carbon bill of the entire project: starting with data collection and storage costs, including research and development budgets, pretraining costs, future serving estimates, and other exogenous costs necessary for this international cooperation. Notably, we find that inference costs and exogenous factors can have a significant impact on the total budget. Finally, we discuss pathways to reduce the carbon footprint of extreme-scale models.

Publications de LightOn

FC-AMF-OCR Dataset : LightOn releases a 9.3 million images OCR dataset to improve real world document parsing, 2024

PyLate: Flexible Training and Retrieval for ColBERT Models

ArabicWeb24: Creating a high quality Arabic Web-only pre-training dataset

Training Mamba Models on AMD MI250/MI250X GPUs with Custom Kernels

LightOn AI Meetup: Creating a Large Dataset for Pretraining LLMs

Passing the Torch: Training a Mamba Model for Smooth Handover

Summary of LightOn AI meetup #14WeightWatcher a Diagnostic Tool for Deep Neural Networks

High Quality data need not apply: training LLMs with web data only

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?

RITA: a Study on Scaling Up Generative Protein Sequence Models

A Holistic Assessment of the Carbon Footprint of Noor, a Very Large Arabic Language Model

No matching results found