Web Analytics Made Easy - Statcounter

Publications de LightOn

Trier par:
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

What Language Model to Train if You Have One Million GPU Hours?

Authors: Daniel Hesslow, Teven Le Scao, Lucile Saulnier, Thomas Wang, M Saiful Bari, Stas Bekman, Stella Biderman, Hady Elsahar, Jason Phang, Ofir Press, Colin Raffel, Victor Sanh, Sheng Shen, Lintang Sutawika, Jaesung Tae, Zheng Xin Yong, Julien Launay, Iz Beltagy

ACL 2022 Workshop BigScience – LLMs – April 2022As the size of language models continues to grow they become increasingly more powerful and lead to better results, but they also become more expensive to design and train. Given a compute budget that’s enough to train a multilingual transformers language model in the 100B+ parameter scale, our goal is to choose the architecture and the training setup of such a model. Specifically, we perform an ablation study comparing different modelling architectural, which can significantly impact the performance of the resulting models. We focus on the 1.3B parameter scale providing a compromise between the compute cost of the architecture search and the probability that our conclusions hold for the target 100B+ model. In addition, we study the impact of various popular pretraining corpora on the quality of the model. We also study the performance of training a multilingual model and how it compares to the English-only one. Finally, we consider the scaling behaviour of transformer models to choose the target model size, its shape and its training setup.

PAGnol: An Extra-Large French Generative Model

‍Authors: Julien Launay, Elena Tommasone, Baptiste Pannier, François Boniface, Amélie Chatelain, Alessandro Cappelli, Iacopo Poli, Djamé Seddah

LREC 2022 – LLMs – Initially published: October 2021
Access to large pre-trained models of varied architectures, in many different languages, is central to the democratization of NLP. We introduce PAGnol, a collection of French GPT models. Using scaling laws, we efficiently train PAGnol-XL (1.5B parameters) with the same computational budget as CamemBERT, a model 13 times smaller. PAGnol-XL is the largest model trained to date for the French language. We plan to train increasingly large and performing versions of PAGnol, exploring the capabilities of French extreme-scale models.For this first release, we focus on the pre-training and scaling calculations underlining PAGnol. We fit a scaling law for compute for the French language and compare it with its English counterpart. We find the pre-training dataset significantly conditions the quality of the outputs, with common datasets such as OSCAR leading to low-quality offensive text. We evaluate our models on discriminative and generative tasks in French, comparing them to other state-of-the-art French and multilingual models, and reaching the state of the art in the abstract summarization task. Our research was conducted on the public GENCI Jean Zay supercomputer, and our models up to the Large are made publicly available.

Scaling Laws Beyond Backpropagation

Authors: Matthew J. Filipovich, Alessandro Cappelli, Daniel Hesslow, Julien Launay

NeurIPS 2022 – Workshop: I Can’t Believe It’s Not Better, December 2022
Alternatives to backpropagation have long been studied to better understand how biological brains may learn. Recently, they have also garnered interest as a way to train neural networks more efficiently. By relaxing constraints inherent to backpropagation (e.g., symmetric feedforward and feedback weights, sequential updates), these methods enable promising prospects, such as local learning. However, the tradeoffs between different methods in terms of final task performance, convergence speed, and ultimately compute and data requirements are rarely outlined. In this work, we use scaling laws to study the ability of Direct Feedback Alignment~(DFA) to train causal decoder-only Transformers efficiently. Scaling laws provide an overview of the tradeoffs implied by a modeling decision, up to extrapolating how it might transfer to increasingly large models. We find that DFA fails to offer more efficient scaling than backpropagation: there is never a regime for which the degradation in loss incurred by using DFA is worth the potential reduction in compute budget. Our finding comes at variance with previous beliefs in the alternative training methods community, and highlights the need for holistic empirical approaches to better understand modeling decisions.

Shedding a PAC-Bayesian Light on Adaptive Sliced-Wasserstein Distances

Authors: Ruben Ohana, Kimia Nadjahi, Alain Rakotomamonjy, Liva Ralaivola

Technical Reports and Preprints – Machine Learning
The Sliced-Wasserstein distance (SW) is a computationally efficient and theoretically grounded alternative to the Wasserstein distance. Yet, the literature on its statistical properties with respect to the distribution of slices, beyond the uniform measure, is scarce. To bring new contributions to this line of research, we leverage the PAC-Bayesian theory and the central observation that SW actually hinges on a slice-distribution-dependent Gibbs risk, the kind of quantity PAC-Bayesian bounds have been designed to characterize. We provide four types of results: i) PAC-Bayesian generalization bounds that hold on what we refer as adaptive Sliced-Wasserstein distances, i.e. distances defined with respect to any distribution of slices, ii) a procedure to learn the distribution of slices that yields a maximally discriminative SW, by optimizing our PAC-Bayesian bounds, iii) an insight on how the performance of the so-called distributional Sliced-Wasserstein distance may be explained through our theory, and iv) empirical illustrations of our findings.

A high-fidelity and large-scale reconfigurable photonic processor for NISQ applications

Authors: A. Cavaillès, P. Boucher, L. Daudet, I. Carron, S. Gigan, K. Müller

Technical Reports and Preprints – Machine Learning
Reconfigurable linear optical networks are a key component for the development of optical quantum information processing platforms in the NISQ era and beyond. We report the implementation of such a device based on an innovative design that uses the mode mixing of a multimode fiber in combination with the programmable wavefront shaping of an SLM. The capabilities of the platform are explored in the classical regime. For up to a record number of 8~inputs and 38~outputs we achieve fidelities in excess of 93%, week-long stability and losses below 6.5dB. The device was built inside a standard server rack to allow for real-world use.

Binarization for Optical Processing Units via REINFORCE

Authors: B. Kozyrskiy, I. Poli, R. Ohana, L. Daudet, I. Carron, M. Filippone

Conference proceedings – Machine Learning – November 2021
Optical Processing Units (OPUs) are computing devices that perform random projections of input vectors by exploiting the physical phenomenon of scattering a light source through an opaque medium. OPUs have successfully been proposed to carry out approximate kernel ridge regression at scale and with low power consumption by the means of optical random features. OPUs require input vectors to be binary, and this work proposes a novel way to perform supervised data binarization. The main difficulty to develop a solution is that the OPU projection matrices are unknown which poses a challenge in deriving a binarization approach in an end-to-end fashion. Our approach is based on the REINFORCE gradient estimator, which allows us to estimate the gradient of the loss function with respect to binarization parameters by treating the OPU as a black box. Through experiments on several UCI classification and regression problems, we show that our method outperforms alternative unsupervised and supervised binarization techniques.

No matching results found

We couldn’t find what you searched for. Try different keywords.