Publications by LightOn

Passing the Torch: Training a Mamba Model for Smooth Handover

We experiment with the Warmup-Stable-Decay (WSD) learning rate scheduler and a novel positional weighting of the loss for language model pre-training; We find that WSD outperforms the cosine scheduler, and positional weighting results in better top k accuracy. We conduct the experiments using the Mamba architecture, which with its linear complexity achieves substantially higher throughput at inference than transformers. Finally, based on our experiments we train Mambaoutai, a 1.6B parameters model on 300B tokens. The training dataset comprises mainly French, English and code from various programming languages. Over 80 checkpoints of Mambaoutai 1.6B are released for the ML community to explore here, which thanks to the WSD scheduler can be further pre-trained smoothly without any degradation.

High Quality data need not apply: training LLMs with web data only

4th workshop on Neural Scaling Laws: Towards Maximally Beneficial AGI, NeurIPS 2022 – Machine Learning/NLP – LLMs
Abstract not available.

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Machine Learning/NLP  – LLMs

Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.

What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?

Machine Learning/NLP – LLMs

Large pretrained Transformer language models have been shown to exhibit zero-shot generalization, i.e. they can perform a wide variety of tasks that they were not explicitly trained on. However, the architectures and pretraining objectives used across state-of-the-art models differ significantly, and there has been limited systematic comparison of these factors. In this work, we present a large-scale evaluation of modeling choices and their impact on zero-shot generalization. In particular, we focus on text-to-text models and experiment with three model architectures (causal/non-causal decoder-only and encoder-decoder), trained with two different pretraining objectives (autoregressive and masked language modeling), and evaluated with and without multitask prompted finetuning. We train models with over 5 billion parameters for more than 170 billion tokens, thereby increasing the likelihood that our conclusions will transfer to even larger scales. Our experiments show that causal decoder-only models trained on an autoregressive language modeling objective exhibit the strongest zero-shot generalization after purely unsupervised pretraining. However, models with non-causal visibility on their input trained with a masked language modeling objective followed by multitask finetuning perform the best among our experiments. We therefore consider the adaptation of pretrained models across architectures and objectives. We find that pretrained non-causal decoder models can be adapted into performant generative causal decoder models, using autoregressive language modeling as a downstream task. Furthermore, we find that pretrained causal decoder models can be efficiently adapted into non-causal decoder models, ultimately achieving competitive performance after multitask finetuning. Code and checkpoints are available at this https URL.

RITA: a Study on Scaling Up Generative Protein Sequence Models

Technical Reports and Preprints – Machine Learning, LLMs for Biology
In this work we introduce RITA: a suite of autoregressive generative models for protein sequences, with up to 1.2 billion parameters, trained on over 280 million protein sequences belonging to the UniRef-100 database. Such generative models hold the promise of greatly accelerating protein design. We conduct the first systematic study of how capabilities evolve with model size for autoregressive transformers in the protein domain: we evaluate RITA models in next amino acid prediction, zero-shot fitness, and enzyme function prediction, showing benefits from increased scale. We release the RITA models openly, to the benefit of the research community.

A Holistic Assessment of the Carbon Footprint of Noor, a Very Large Arabic Language Model

ACL 2022 Workshop BigScience – LLMs – April 2022
As ever-larger language models grow more ubiquitous, it is crucial to consider their environmental impact. Characterised by extreme size and resource use, recent generations of models have been criticised for their voracious appetite for compute, and thus significant carbon footprint. Although reporting of carbon impact has grown more common in machine learning papers, this reporting is usually limited to compute resources used strictly for training. In this work, we propose a holistic assessment of the footprint of an extreme-scale language model, Noor. Noor is an ongoing project aiming to develop the largest multi-task Arabic language models–with up to 13B parameters–leveraging zero-shot generalisation to enable a wide range of downstream tasks via natural language instructions. We assess the total carbon bill of the entire project: starting with data collection and storage costs, including research and development budgets, pretraining costs, future serving estimates, and other exogenous costs necessary for this international cooperation. Notably, we find that inference costs and exogenous factors can have a significant impact on the total budget. Finally, we discuss pathways to reduce the carbon footprint of extreme-scale models.

What Language Model to Train if You Have One Million GPU Hours?

ACL 2022 Workshop BigScience – LLMs – April 2022
As the size of language models continues to grow they become increasingly more powerful and lead to better results, but they also become more expensive to design and train. Given a compute budget that’s enough to train a multilingual transformers language model in the 100B+ parameter scale, our goal is to choose the architecture and the training setup of such a model. Specifically, we perform an ablation study comparing different modelling architectural, which can significantly impact the performance of the resulting models. We focus on the 1.3B parameter scale providing a compromise between the compute cost of the architecture search and the probability that our conclusions hold for the target 100B+ model. In addition, we study the impact of various popular pretraining corpora on the quality of the model. We also study the performance of training a multilingual model and how it compares to the English-only one. Finally, we consider the scaling behaviour of transformer models to choose the target model size, its shape and its training setup.

PAGnol: An Extra-Large French Generative Model

LREC 2022 – LLMs – Initially published: October 2021
Access to large pre-trained models of varied architectures, in many different languages, is central to the democratization of NLP. We introduce PAGnol, a collection of French GPT models. Using scaling laws, we efficiently train PAGnol-XL (1.5B parameters) with the same computational budget as CamemBERT, a model 13 times smaller. PAGnol-XL is the largest model trained to date for the French language. We plan to train increasingly large and performing versions of PAGnol, exploring the capabilities of French extreme-scale models.
For this first release, we focus on the pre-training and scaling calculations underlining PAGnol. We fit a scaling law for compute for the French language and compare it with its English counterpart. We find the pre-training dataset significantly conditions the quality of the outputs, with common datasets such as OSCAR leading to low-quality offensive text. We evaluate our models on discriminative and generative tasks in French, comparing them to other state-of-the-art French and multilingual models, and reaching the state of the art in the abstract summarization task. Our research was conducted on the public GENCI Jean Zay supercomputer, and our models up to the Large are made publicly available.

Scaling Laws Beyond Backpropagation

NeurIPS 2022 – Workshop: I Can’t Believe It’s Not Better, December 2022
Alternatives to backpropagation have long been studied to better understand how biological brains may learn. Recently, they have also garnered interest as a way to train neural networks more efficiently. By relaxing constraints inherent to backpropagation (e.g., symmetric feedforward and feedback weights, sequential updates), these methods enable promising prospects, such as local learning. However, the tradeoffs between different methods in terms of final task performance, convergence speed, and ultimately compute and data requirements are rarely outlined. In this work, we use scaling laws to study the ability of Direct Feedback Alignment~(DFA) to train causal decoder-only Transformers efficiently. Scaling laws provide an overview of the tradeoffs implied by a modeling decision, up to extrapolating how it might transfer to increasingly large models. We find that DFA fails to offer more efficient scaling than backpropagation: there is never a regime for which the degradation in loss incurred by using DFA is worth the potential reduction in compute budget. Our finding comes at variance with previous beliefs in the alternative training methods community, and highlights the need for holistic empirical approaches to better understand modeling decisions.

Shedding a PAC-Bayesian Light on Adaptive Sliced-Wasserstein Distances

Technical Reports and Preprints – Machine Learning
The Sliced-Wasserstein distance (SW) is a computationally efficient and theoretically grounded alternative to the Wasserstein distance. Yet, the literature on its statistical properties with respect to the distribution of slices, beyond the uniform measure, is scarce. To bring new contributions to this line of research, we leverage the PAC-Bayesian theory and the central observation that SW actually hinges on a slice-distribution-dependent Gibbs risk, the kind of quantity PAC-Bayesian bounds have been designed to characterize. We provide four types of results: i) PAC-Bayesian generalization bounds that hold on what we refer as adaptive Sliced-Wasserstein distances, i.e. distances defined with respect to any distribution of slices, ii) a procedure to learn the distribution of slices that yields a maximally discriminative SW, by optimizing our PAC-Bayesian bounds, iii) an insight on how the performance of the so-called distributional Sliced-Wasserstein distance may be explained through our theory, and iv) empirical illustrations of our findings.

A high-fidelity and large-scale reconfigurable photonic processor for NISQ applications

Technical Reports and Preprints – Machine Learning
Reconfigurable linear optical networks are a key component for the development of optical quantum information processing platforms in the NISQ era and beyond. We report the implementation of such a device based on an innovative design that uses the mode mixing of a multimode fiber in combination with the programmable wavefront shaping of an SLM. The capabilities of the platform are explored in the classical regime. For up to a record number of 8~inputs and 38~outputs we achieve fidelities in excess of 93%, week-long stability and losses below 6.5dB. The device was built inside a standard server rack to allow for real-world use.

Binarization for Optical Processing Units via REINFORCE

Conference proceedings – Machine Learning – November 2021
Optical Processing Units (OPUs) are computing devices that perform random projections of input vectors by exploiting the physical phenomenon of scattering a light source through an opaque medium. OPUs have successfully been proposed to carry out approximate kernel ridge regression at scale and with low power consumption by the means of optical random features. OPUs require input vectors to be binary, and this work proposes a novel way to perform supervised data binarization. The main difficulty to develop a solution is that the OPU projection matrices are unknown which poses a challenge in deriving a binarization approach in an end-to-end fashion. Our approach is based on the REINFORCE gradient estimator, which allows us to estimate the gradient of the loss function with respect to binarization parameters by treating the OPU as a black box. Through experiments on several UCI classification and regression problems, we show that our method outperforms alternative unsupervised and supervised binarization techniques.

Photonic co-processors in HPC: using LightOn OPUs for Randomized Numerical Linear Algebra

Conference Proceedings – Randomized Numerical Linear Algebra – July 2021
Randomized Numerical Linear Algebra (RandNLA) is a powerful class of methods, widely used in High Performance Computing (HPC). RandNLA provides approximate solutions to linear algebra functions applied to large signals, at reduced computational costs. However, the randomization step for dimensionality reduction may itself become the computational bottleneck on traditional hardware. Leveraging near constant-time linear random projections delivered by LightOn Optical Processing Units we show that randomization can be significantly accelerated, at a negligible precision loss, in a wide range of important RandNLA algorithms, such as RandSVD or trace estimators.

LightOn Optical Processing Unit: Scaling-up AI and HPC with a Non von Neumann co-processor

Conference Proceedings – Randomized Numerical Linear Algebra / Hardware – July 2021
We introduce LightOn’s Optical Processing Unit (OPU), the first photonic AI accelerator chip available on the market for at-scale Non von Neumann computations, reaching 1500 TeraOPS. It relies on a combination of free-space optics with off-the-shelf components, together with a software API allowing a seamless integration within Python-based processing pipelines. We discuss a variety of use cases and hybrid network architectures, with the OPU used in combination with CPU/GPU, and draw a pathway towards “optical advantage”.

Photonic Differential Privacy with Direct Feedback Alignment

Technical Reports and Preprints – Robust Machine Learning – June 2021
Optical Processing Units (OPUs) — low-power photonic chips dedicated to large scale random projections — have been used in previous work to train deep neural networks using Direct Feedback Alignment (DFA), an effective alternative to backpropagation. Here, we demonstrate how to leverage the intrinsic noise of optical random projections to build a differentially private DFA mechanism, making OPUs a solution of choice to provide a private-by-design training. We provide a theoretical analysis of our adaptive privacy mechanism, carefully measuring how the noise of optical random projections propagates in the process and gives rise to provable Differential Privacy. Finally, we conduct experiments demonstrating the ability of our learning procedure to achieve solid end-task performance.

Experimental Approach to Demonstrating Contextuality for Qudits

Technical Reports and Preprints – Quantum
We propose a method to experimentally demonstrate contextuality with a family of tests for qudits. The experiment we propose uses a qudit encoded in the path of a single photon and its temporal degrees of freedom. We consider the impact of noise on the effectiveness of these tests, taking the approach of ontologically faithful non-contextuality. In this approach, imperfections in the experimental setup must be taken into account in any faithful ontological (classical) model, which limits how much the statistics can deviate within different contexts. In this way, we bound the precision of the experimental setup under which ontologically faithful non-contextual models can be refuted. We further consider the noise tolerance through different types of decoherence models on different types of encodings of qudits. We quantify the effect of the decoherence on the required precision for the experimental setup in order to demonstrate contextuality in this broader sense.

Contrastive Embeddings for Neural Architectures

Technical Reports and Preprints – Neural Architecture Search
The performance of algorithms for neural architecture search strongly depends on the parametrization of the search space. We use contrastive learning to identify networks across different initializations based on their data Jacobians, and automatically produce the first architecture embeddings independent from the parametrization of the search space. Using our contrastive embeddings, we show that traditional black-box optimization algorithms, without modification, can reach state-of-the-art performance in Neural Architecture Search. As our method provides a unified embedding space, we perform for the first time transfer learning between search spaces. Finally, we show the evolution of embeddings during training, motivating future studies into using embeddings at different training stages to gain a deeper understanding of the networks in a search space.

Adversarial Robustness by Design through Analog Computing and Synthetic Gradients

Technical Reports and Preprints – Robust Machine Learning
We propose a new defense mechanism against adversarial attacks inspired by an optical co-processor, providing robustness without compromising natural accuracy in both white-box and black-box settings. This hardware co-processor performs a nonlinear fixed random transformation, where the parameters are unknown and impossible to retrieve with sufficient precision for large enough dimensions. In the white-box setting, our defense works by obfuscating the parameters of the random projection. Unlike other defenses relying on obfuscated gradients, we find we are unable to build a reliable backward differentiable approximation for obfuscated parameters. Moreover, while our model reaches a good natural accuracy with a hybrid backpropagation – synthetic gradient method, the same approach is suboptimal if employed to generate adversarial examples. We find the combination of a random projection and binarization in the optical system also improves robustness against various types of black-box attacks. Finally, our hybrid training method builds robust features against transfer attacks. We demonstrate our approach on a VGG-like architecture, placing the defense on top of the convolutional features, on CIFAR-10 and CIFAR-100. Code is available at this https URL.

Direct Feedback Alignment Scales to Modern Deep Learning Tasks and Architectures

Conference Proceedings – Machine Learning/ Deep Learning
Despite being the workhorse of deep learning, the backpropagation algorithm is no panacea. It enforces sequential layer updates, thus preventing efficient parallelization of the training process. Furthermore, its biological plausibility is being challenged. Alternative schemes have been devised; yet, under the constraint of synaptic asymmetry, none have scaled to modern deep learning tasks and architectures. Here, we challenge this perspective and study the applicability of Direct Feedback Alignment to neural view synthesis, recommender systems, geometric learning, and natural language processing.

Reservoir Computing meets Recurrent Kernels and Structured Transforms

Conference Proceedings – Machine Learning Techniques
Reservoir Computing is a class of simple yet efficient Recurrent Neural Networks where internal weights are fixed at random and only a linear output layer is trained. In the large size limit, such random neural networks have a deep connection with kernel methods. Our contributions are threefold: a) We rigorously establish the recurrent kernel limit of Reservoir Computing and prove its convergence. b) We test our models on chaotic time series prediction, a classic but challenging benchmark in Reservoir Computing, and show how the Recurrent Kernel is competitive and computationally efficient when the number of data points remains moderate. c) When the number of samples is too large, we leverage the success of structured Random Features for kernel approximation by introducing Structured Reservoir Computing. The two proposed methods, Recurrent Kernel and Structured Reservoir Computing, turn out to be much faster and more memory-efficient than conventional Reservoir Computing.

The dynamics of learning with feedback alignment

Technical Reports and Preprints – Machine Learning / Deep Learning
Direct Feedback Alignment (DFA) is emerging as an efficient and biologically plausible alternative to the ubiquitous backpropagation algorithm for training deep neural networks. Despite relying on random feedback weights for the backward pass, DFA successfully trains state-of-the-art models such as Transformers. On the other hand, it notoriously fails to train convolutional networks. An understanding of the inner workings of DFA to explain these diverging results remains elusive. Here, we propose a theory for the success of DFA. We first show that learning in shallow networks proceeds in two steps: an alignment phase, where the model adapts its weights to align the approximate gradient with the true gradient of the loss function, is followed by a memorisation phase, where the model focuses on fitting the data. This two-step process has a degeneracy breaking effect: out of all the low-loss solutions in the landscape, a network trained with DFA naturally converges to the solution which maximises gradient alignment. We also identify a key quantity underlying alignment in deep linear networks: the conditioning of the alignment matrices. The latter enables a detailed understanding of the impact of data structure on alignment and suggests a simple explanation for the well-known failure of DFA to train convolutional neural networks. Numerical experiments on MNIST and CIFAR10 clearly demonstrate degeneracy breaking in deep non-linear networks and show that the align-then-memorize process occurs sequentially from the bottom layers of the network to the top.

Light-in-the-loop: using a photonics co-processor for scalable training of neural networks

Conference Proceedings – Hardware
As neural networks grow larger and more complex and data-hungry, training costs are skyrocketing. Especially when lifelong learning is necessary, such as in recommender systems or self-driving cars, this might soon become unsustainable. In this study, we present the first optical co-processor able to accelerate the training phase of digitally implemented neural networks.

NEWMA: a new method for scalable model-free online change-point detection

Journal Papers – Time Series
We consider the problem of detecting abrupt changes in the distribution of a multi-dimensional time series, with limited computing power and memory. In this paper, we propose a new, simple method for model-free online change-point detection that relies only on fast and light recursive statistics, inspired by the classical Exponential Weighted Moving Average algorithm (EWMA).

Kernel computations from large-scale random features obtained by Optical Processing Units

Conference Proceedings – Computer Vision
Approximating kernel functions with random features (RFs) has been a successful application of random projections for nonparametric estimation. However, performing random projections presents computational challenges for large-scale problems. Recently, a new optical hardware called Optical Processing Unit (OPU) has been developed for fast and energy-efficient computation of large-scale RFs in the analog domain.

Fast Optical System Identification by Numerical Interferometry

Conference Proceedings – Signal Processing
We propose a numerical interferometry method for identification of optical multiply-scattering systems when only intensity can be measured. Our method simplifies the calibration of optical transmission matrices from a quadratic to a linear inverse problem by first recovering the phase of the measurements. We show that by carefully designing the probing signals, measurement phase retrieval amounts to a distance geometry problem—a multilateration—in the complex plane.

Online Change Point Detection in Molecular Dynamics With Optical Random Features

Technical Reports and Preprints – Time Series
Proteins are made of atoms constantly fluctuating, but can occasionally undergo large-scale changes. Such transitions are of biological interest, linking the structure of a protein to its function with a cell. Atomic-level simulations, such as Molecular Dynamics (MD), are used to study these events. However, molecular dynamics simulations produce time series with multiple observables, while changes often only affect a few of them.

Optical Reservoir Computing using multiple light scattering for chaotic systems prediction

Journal Papers – Time Series
Reservoir Computing is a relatively recent computational framework based on a large Recurrent Neural Network with fixed weights. Many physical implementations of Reservoir Computing have been proposed to improve speed and energy efficiency. In this study, we report new advances in Optical Reservoir Computing using multiple light scattering to accelerate the recursive computation of the reservoir states.

Don’t take it lightly: Phasing optical random projections with unknown operators

Conference Proceeding – Signal Processing
In this paper we tackle the problem of recovering the phase of complex linear measurements when only magnitude information is available and we control the input. We are motivated by the recent development of dedicated optics-based hardware for rapid random projections which leverages the propagation of light in random media.

Principled Training of Neural Networks with Direct Feedback Alignment

Technical Reports and Preprints – Machine Learning Techniques
The backpropagation algorithm has long been the canonical training method for neural networks. Modern paradigms are implicitly optimized for it, and numerous guidelines exist to ensure its proper use. Recently, synthetic gradients methods -where the error gradient is only roughly approximated – have garnered interest. These methods not only better portray how biological brains are learning, but also open new computational possibilities, such as updating layers asynchronously.

Machine learning and the physical sciences

Journal Papers – Quantum Physics
Machine learning (ML) encompasses a broad range of algorithms and modeling tools used for a vast array of data processing tasks, which has entered most scientific disciplines in recent years. This article reviews in a selective way the recent research on the interface between machine learning and the physical sciences.

Scaling up Echo-State Networks with multiple light scattering

Conference Proceedings – Time Series
Echo-State Networks and Reservoir Computing have been studied for more than a decade. They provide a simpler yet powerful alternative to Recurrent Neural Networks, every internal weight is fixed and only the last linear layer is trained. They involve many multiplications by dense random matrices. Very large networks are difficult to obtain, as the complexity scales quadratically both in time and memory.

Random Projections through multiple optical scattering: Approximating kernels at the speed of light

Conference Proceedings – Hardware
Random projections have proven extremely useful in many signal processing and machine learning applications. However, they often require either to store a very large random matrix, or to use a different, structured matrix to reduce the computational and memory costs. Here, we overcome this difficulty by proposing an analog, optical device, that performs the random projections literally at the speed of light without having to store any matrix in memory.