Contenu
TL;DR
"We experiment with the Warmup-Stable-Decay (WSD) learning rate scheduler and a novel positional weighting of the loss for language model pre-training; We find that WSD outperforms the cosine scheduler, and positional weighting results in better top k accuracy. We conduct the experiments using the Mamba architecture, which with its linear complexity achieves substantially higher throughput at inference than transformers. Finally, based on our experiments we train Mambaoutai, a 1.6B parameters model on 300B tokens. The training dataset comprises mainly French, English and code from various programming languages. Over 80 checkpoints of Mambaoutai 1.6B are released for the ML community to explore here, which thanks to the WSD scheduler can be further pre-trained smoothly without any degradation."
1. Introduction
We present our explorations on training language models based on the new Mamba architecture [1], which deviates from the traditional Transformer architecture. Because of its linear complexity and computing efficiency, Mamba caught a lot of attention among practitioners and the ML broader community. These explorations yield a training recipe used to effectively train a 1.6B model, Mambaoutai, that we also release openly for the community. We make over 80 checkpoints available, enabling interpretability studies and further training. Besides the main English training corpus, the modelâs training data includes a significant portion of French data and code from various programming languages (Rust is among them if you were wondering).
We will outline all the components involved in training Mambaoutai, from data preparation and modeling decisions to the training setup itself. We discuss the training throughput achieved with a Fully Sharded Data Parallel (FSDP) codebase and compare it with the Nanotron library [3]. We further explore the impact of different training techniques on Mambaoutaiâs performance and convergence during pre-training. We focus on comparing three strategies aimed at enhancing pre-training: two learning rate schedulers and positional weighting of the pre-training loss.
First, we examine the effects of different learning rate schedulers on model performance within a fixed training budget. Our comparison includes the traditional warmup with cosine decay and the Warmup-Stable-Decay (WSD) scheduler introduced for MiniCPM [2]. The WSD scheduler, compared to the cosine scheduler, has a clear separation between pre-training stages enabling introduction of high quality and/or instruction data during the final decay phase, and allows seamless continuation of pretraining from a pre-decay checkpoint without the concern of a cold restart.
Finally, we explore how weighting different token positions in the loss function can improve accuracy across top-k metrics by decreasing the loss weight assigned to model predictions for the first tokens of every text sequence.
All the checkpoints are available on Hugging Face at: ï»żï»żMambaoutaiï»żï»ż.
The training code is available on Github at: ï»żï»żCodeï»żï»ż.
2. Training: Throughput analysis
A central aspect of large language model training is the distribution of the training load over several accelerators. For instance, DBRX-132B is allegedly trained over 2 weeks on 3072 GPUs [9], and Falcon-180B was trained on up to 4096 GPUs [10], exhibiting the crucial importance of efficient distributed computation required for such model sizes. Whereas these models are substantially larger than the models trained in our Mamba experiments, smaller models can still benefit significantly from distributed training as it allows to speed up the training that otherwise risks being too slow due to the large scale of training datasets.
There are two general approaches when it comes to distributed training :
3D parallelism [14, 18]: this approach requires a model architecture-specific approach where the model design is changed in order to make it compliant with the specifics of 3D parallelism.
PyTorch FSDP [11, 12]: this approach is much simpler and does not require intrusive modifications to the modelâs forward. This is done by sharding the model parameters, gradients and optimizer states across accelerators by wrapping the model into individual units called FSDP units.
We compare our adapted version of the composer codebase [23] using FSDP with the Nanotron ⥠codebase [3] that leverages 3D parallelism on a 2B parameter scale. All training runs are conducted with bfloat16 [19] and mixed precision training [13].
The benchmarking is carried out on one node of 4xA100-64GB GPUs. The context length is fixed at 4096. We optimize both setups and only show the best results for each codebase. The results of the different runs are presented in the table below:
Parallelism Type | Model | Batch Size | Block-wise Activation Recomputation | Throughput | TFLOPs |
---|---|---|---|---|---|
FSDP | 1.6B | 5 | Yes | 11000 | 96.25 |
3D ( |
1.6B | 2 | Not supported | 4880 | 51 |
With our codebase, we achieve a throughput of 96.25
TFLOPs per second per GPU, which corresponds to an MFU (Model FLOPs
Utilization) of
Finally, we also find that the Nanotron 3D parallelism achieves worse results than our codebase with PyTorch FSDP for this model size. One reason is the use of the per block activation recomputation, and other reasons might include increased communication and the existence of bubble times in the pipeline parallelism (Nanotron uses one-forward-one-backward pipeline scheduler) resulting in underutilization of the hardware. It is expected that 3D parallelism would be better suited for larger models and achieve better throughput at a larger scale.
3. Model Architecture and Dataset
3.1 Model Architecture
The model configuration for Mambaoutai 1.6B follows the original
implementation but differs in a few architectural details. The first
difference is the use of a much wider hidden state compared to the
number of layers, as this enables higher throughput with a close to
negligible degradation in training loss. Theoretically, the larger
hidden state allows the model to compress more information from previous
parts of a context, although introducing a trade-off of less depth in
the model given a constant model size. Another difference is that we do
not tie the word embedding matrix with the last projection layer in this
work for the sake of simplicity and ease of integration with PyTorch
FSDP. Finally, one of the key aspects of the Mamba architecture is its
linear complexity dependency to the context length, which is very useful
to handle long contexts. We thus extended the context length up to
3.2 Data Preparation
We wanted Mambaoutai to be competent in French, English and coding. We thus created a pre-training dataset reflecting these domains, aligned with the state-of-the-art large language model practices, incorporating a mix of widely used datasets like RedPajama-Data-V2, subsets of RedPajama-Data-1T and the Stack. More details about the different datasets and the custom tokenizer training details are given below.
3.2.1 Datasets
To train our Mamba, we have used only open-source datasets available on Hugging Face to encourage reproducibility and transparency, following recent releases [25]. These datasets are already filtered and deduplicated, which saves us a lot of time and resources in preprocessing.
togethercomputer/RedPajama-Data-V2
This dataset, including over 100B text documents coming from 84 CommonCrawl snapshots, was our primary source of data during our stable pre-training phase. All available Common Crawl snapshots contain 5000 shards per language.
We used the first 30 shards of each snapshot
(head_middle
partition) to gather 78B tokens in English,
and the first 300 shards to gather around 97B tokens in French.
Using the precomputed metrics of the RedPajama-Data-V2 dataset, we removed the duplicated samples and filtered the dataset with the Gopher[6] and C4[7] filters.
togethercomputer/RedPajama-Data-1T
We used only the book
, arxiv
, and
wikipedia
(English and French) subsets of this dataset.
The other subsets of this dataset (C4, Commoncrawl and GitHub) were not used because of possible duplication with the other datasets used for training (RedPajama V2 for CommonCrawl and The Stack for code).
bigcode/the-stack-dedup
Since we wanted our model to have reasonable coding capabilities, we
decided to use a non-negligible amount of code extracted from this
near-deduplicated version of bigcode/the-stack
.
We only used the most popular programming languages of GitHub:
javascript
, python
, cpp
,
c
, java
, go
, rust
,
and php
.
3.2.3 Dataset distribution
As previously introduced, the WSD scheduler is composed of two phases: the stable learning phase, during which the model has a steady and high learning rate for a fast training and the decay phase, where the model is converging and thus requires higher quality data. Accordingly, we have two data mixtures for each of the phases.
Our stable pre-training dataset distribution is as follows:
Data distribution during the stable phase of the training
We then add instruction data to our decay phase, resulting in the following dataset distribution:
Data distribution during the decay phase
of the training
For the decay phase, we added one epoch of the UltraChat dataset [24] and scaled down the other datasets accordingly.
3.2.3 Tokenizer
We decided to train our tokenizer from scratch with the HuggingFace
tokenizers
library to improve coverage of our training
datasets and to have a larger vocabulary size. Larger vocabulary sizes
compress the sequence length further, meaning more information can be
packed into the context, which is critical for language models.
We trained a Byte-Pair Encoding (BPE) tokenizer with a vocabulary size of 64k tokens and added a set of special âChatMLâ tokens used during the instruction fine-tuning to simulate a conversation between the user and the assistant. The tokenizer was trained on 15GB of data, and the language distribution approximately matches the training distribution of the Mamba model (10% of code, 30% of French and the remaining 60% of English).
4. Experiments and Ablations
In this section, we detail the experiments conducted to validate different technical choices for the training procedure. These experiments were made to study the choice of different learning rate schedulers and the effect of loss positional weighting on performance. We used a smaller 0.5B parameters model and a maximum of 10B tokens for computational reasons.
We adhere to the architecture and hyperparameters used in the
original Mamba [1] implementation. The peak learning rate is chosen
based on a small hyperparameter search over 1B tokens, with a maximum
value of
4.1 Learning rate schedulers
4.1.1 Cosine scheduler
The cosine scheduler makes the learning rate follow a cosine-shaped function, starting with a high learning rate at the beginning of training and gradually decreasing it to a minimum value over time. This approach allows for faster initial progress while ensuring finer steps towards the end of training. Compared to other schedulers, such as step decay or exponential decay, the cosine scheduler has been shown to provide better performance on various tasks, including image classification, language modeling and neural machine translation, and has become the default scheduler, at least in NLP [10].
The used cosine scheduler starts with a linear warmup for 480 steps
and then decays the learning rate following a cosine curve to the
minimum value of
The cosine scheduler with a linear warmup
phase
A critical limitation of cosine learning rate decay is that it achieves optimal performance only when performing an entire cosine period [4], forcing practitioners to fix the number of steps beforehand, which poses a significant hurdle if we want to continue pre-training later when more data or/and compute becomes available.
4.1.2 WSD scheduler
The Warmup-Stable-Decay (WSD) [2] scheduler solves this issue and allows to continue pre-training without having to define a predetermined number of steps. This is particularly handy because the number of training steps might not be known a priori, especially if the model is further trained after being openly release. The WSD scheduling strategy is composed of three stages: warmup phase, stable learning rate phase, and decay phase. The warmup stage gradually increases the learning rate from a minimum to a maximum value over a specified number of steps or amount of training. The stable training stage keeps the learning rate constant at the peak value resulting in faster training overall. In the decay stage, the learning rate decreases according to a predefined function, such as exponential or linear decay. The decay phase ensures convergence while avoiding overshooting near convergence points. Additionally, high quality and/or instruction data can be added to the pretraining data mix at the start of the decay phase, enabling a more efficient way of integrating resource-constrained data during pretraining. What is even more interesting is that we can continue pre-training from any checkpoint from the stable stage without worrying about very low learning rates as is the case with the cosine scheduler or warm model degradation [17, 16]. Contrary to checkpoints obtained with the cosine scheduler, any checkpoint prior to the decay phase can be used as a starting point to continue the pre-training. So for ML practitioners it is easy to take a model pre-trained for 300B tokens and continue pre-training up to 3T tokens for example. This has already been advocated in the computer vision domain [26], and we think it should become the standard for NLP. We envision this as being a game changer for the open model ecosystem as entities can iteratively continue pretraining open models and collectively train models trained over volumes of data that were impossible for smaller organizations. In this mindset we release Mambaoutai pre and post decay, for open science as well as hoping to kickstart a virtuous collective pre-training effort.
The WSD scheduler we used starts with a linear warmup for 480 steps,
similar to the cosine schedule, followed by a constant learning rate at
its peak value for
The WSD scheduler
4.2 Loss Positional Weighting
The early tokens of a sequence are harder for the model to predict, as they are conditioned on fewer tokens. As a consequence, the cross-entropy loss is substantially higher for the initial tokens. We hypothesize that the loss on early tokens thus contributes to a noisier learning signal. Furthermore model generations are rarely conditioned on just a few tokens in practice. Hence, we propose to apply a positional weighting of the loss, i.e, multiplying each token loss term by a weight based on its position. We therefore experiment with down weighting the loss on early tokens during training, using a scaled hyperbolic tangent (tanh) function as a starting point, where the weight is calculated as:
The figure below shows that, in practice, except for the first few
tokens (around
The scaled hyperbolic tangent function(tanh) used to scale the loss
In summary, we have conducted three experiments:
- We train a 0.5B Mamba with a WSD scheduler for 10B tokens.
- We train a 0.5B Mamba with a Cosine scheduler for 10B tokens.
- We train a 0.5B Mamba with a WSD scheduler and positional weighting for 10B tokens.
4.3 Insights
The results of each setup is given in the following figure:
Training loss curves for all three
experiments
We observe that the training losses of different setups were similar throughout the training process, except for some spikes for the runs using the WSD scheduler. Despite the spikes, the WSD runs still achieved a better training loss than the cosine scheduler, which achieved a higher loss. Spikes in loss can occur for various reasons such as dirty data, or simply due to model scale, and may also happen randomly during optimization when passing through a bad optimization point. As the spikes only occur during training with WSD, and not for training with cosine schedule for the same model size and exact same data, we can rule out the bad data and model size as the main triggering factor of the spikes. Instead, it seems like the spikes were caused by maintaining a peak learning rate for an extended training period. This indicates that stable training with WSD would require using a lower learning rate than a cosine run. In any case, even though WSD might be less stable, it is still a better choice than the cosine scheduler.
Focusing on the decay phase allows for a clearer view of the training dynamics during this phase, as illustrated in the following figure. The WSD scheduler exhibits a totally different training dynamic compared to the cosine scheduler. In fact, while it lags behind the latter during all of the stable stage, it quickly catches up and surpasses it during the decay phase. This illustrates how critical the decay phase is for the model and should be carefully designed. Also, this means that sharing pre-decay checkpoints allow practitioners to not only continue pre-training but also perform their own appropriate decay phase.
Training loss curves during the decay
phase for all three experiments
On the validation set, the WSD schedule achieved the best perplexity
of
Validation perplexity during the decay
phase for all three experiments
However, perplexity alone may not provide a comprehensive evaluation
of the end model. Therefore, we also decided to include top-k accuracy
metrics for k in
In the figure below, we see that the run using WSD and positional
weighting again achieves better validation results in terms of top 3 and
top 10 accuracies, while runs without positional weighting (both WSD and
cosine) are almost the same for top 10 accuracy. We note that the top 3
accuracy of the WSD with positional weighting run is at
Top-k accuracy metrics for k in
While positional weighting of the loss resulted in slightly improved top-k validation metrics, it did not improve the perplexity. This is expected as it trades prediction quality of earlier tokens for latter ones, and perplexity is dominated by the earlier tokens. Top-k accuracy, on the other hand, implicitly normalizes the contribution of each position, and thus is not as dominated by model performance on lower positions. The results support our hypothesis that the first tokens are not modeled with the same strategy as the later ones. For predictions of early tokens the model needs to model their unconditional distribution as itâs basically a guessing game, while prediction on later tokens can better exploit language causality from previous tokens. Moreover, the loss being dominated by the loss on the first few tokens indicates that they dominate on the gradient as well. As the first few tokens are very rarely inferred in all prompt-based use cases, we reduce the taskâs complexity, mitigate this split objective, and reduce the gradient noise. Overall, combining these techniques can lead to improved model performance and generalization capabilities.
5. Experimental Results and Benchmarks
In the following section, we present and analyze the performance of our model Mambaoutai 1.6B, trained based on our observations in the previous section, using the WSD scheduler and positional weighting.
We train Mambaoutai 1.6B following the previous recipe while scaling
up the hidden state and the number of layers. We select the peak
learning rate using the same strategy as done in the experiments.
Precisely, we use a batch size of
During the training process of Mambaoutai, we also encountered some spikes. Whenever a loss spike occurred during training, we restarted the training from a previous checkpoint and skipped the data causing the spike. We also had to restart the training once due to hardware issues. The different chronicles of the training are represented on the training loss curve below, showcasing the different restarts.
The training loss for Mambaoutai with the
different restarts
5.1 Long-range Performance
One of the main strengths of the Mamba architecture is its ability to
process long inputs more efficiently. Given that we specifically train
Mambaoutai to leverage this property (larger context/hidden states), we
evaluate the resulting capacity using the Needle-in-a-haystack
evaluation. The Needle-in-a-haystack test, as described in [5], is a
method used to evaluate the retrieval performance of language models
over various context window sizes. In this test, a specific needle (a
random number between
We note in particular that Mamba not requiring positional embeddings facilitates context length extension without bells and whistles, unlike transformer-based models where extending the context window requires sophisticated methods such as YaRN [20], StreamingLLM [21], NTK-YaRN [22] or LongLlama.
The figure presented below showcases the performance of Mambaoutai in
a Needle-in-a-Haystack evaluation. It demonstrates that the model is
nearly flawless in identifying the needle within a pre-training context
length of
Needle in a haystack evaluation for
Mambaoutai on a 16k context length
To draw a comparison between Mambaoutai and existing transformer
models of the same size, we chose the Stable LM 2 1.6B [8] model
released by Stability AI, which was trained with the same context length
of
Needle in a haystack evaluation for
Stable LM 2 1.6B on a 16k context length
Whether larger Mamba models can surpass the performance of transformer-based architectures remains unanswered. Expanding the size of a Mamba modelâs hidden state and consequently its model size would likely allow it to retain more context information. However, for a constant model size, there is a compromise between depth and width, thus resulting in a possible tradeoff between processing capabilities and long context handling. Moreover, the lack of positional embeddings in Mamba models means they can naturally handle larger contexts without worrying about pre-training context length.
5.2 Academic Benchmarks
Generally, benchmarks ought to be approached with caution due to the risk of contamination or manipulation especially when the models are trained on trillions of tokens of data. With this in mind, we provide indicative results to have an idea of model performance on popular benchmarks. In line with our training dataset mixture for which the biggest two subsets consist of French and English data, we evaluated our model on two sets of benchmarks: FrenchBench and the Open LLM leaderboard benchmarks (English). In order to have a reference and a baseline of comparison for our model we have chosen two similarly sized models: Stable LM 2 1.6B and Falcon-RW-1B (1.3B). Before diving into the results, it is important to note that the comparison is to be considered with the following differences in the number of tokens used for training in mind. In fact, model performance is highly correlated with the number of training tokens used during training. Here is a detailed view of the amount of data used in the training of the models being compared:
Model | French Data | English Data | Total |
---|---|---|---|
Mambaoutai 1.6B | 100B | 150B | 300B |
Stable LM 2 1.6B | 20B | 1675B | 2000B |
Falcon 1B | 0 | 350B | 350B |
First, we note that Mambaoutai 1.6B is the one with the least total number of tokens used for training, followed by Falcon-RW-1B, then by a large margin we have Stable LM 2 1.6B with 2T tokens. Second, Falcon-RW-1B is trained mainly on English and Stable LM 2 1.6B used only 20B tokens while Mambaoutai was trained on 100B tokens of French. Finally, Stable LM 2 1.6B was trained on around 1675B tokens of English followed by Falcon-RW-1B with 350B tokens and Mambaoutai 1.6B with 150B tokens. The Open LLM Leaderboard Benchmarks include common sense reasoning benchmarks such as HellaSwag (10-shot), WinoGrande (5-shot) and ARC-Challenge (25-shot). It also includes GSM8K (5-shot), TruthfulQA (zero-shot) and MMLU (5-shot).
As expected given the outlined differences in pre-training data, Stable LM 2 1.6B typically performs better than both Falcon-RW-1B and Mambaoutai 1.6B In the English-centric benchmark, with an exception in the TruthfulQA benchmark, where Falcon-RW-1B shows superior performance compared to the other two models. For the remaining benchmarks, Mambaoutai 1.6B outperforms Falcon-RW-1B in three cases and matches its performance in two others. For GSM8K, we note that Stable LM 2 1.6B is the only model with dedicated math datasets in the pretraining, whereas Mambaoutai and Falcon-RW-1B have no such data and get an accuracy close to zero.
Open LLM Leaderboard Benchmarks
Finally, on French we can see that Mambaoutai is outperforming both Stable LM 2 1.6B and Falcon-RW-1B on tasks such as grammar, vocabulary and topic based Natural Language Inference, while it still lags behind on the reasoning-heavy tasks that have been translated from English using GPT3.5. We attribute these results to the training data discrepancies as Mambaoutai 1.6B is trained on more French tokens than both Stable LM 2 1.6B and Falcon-RW-1B.
French Bench for French language evaluations
As a conclusion, we note that the model underperforms its comparatively sized transformers counterpart Stable LM 2 1.6B [8] on standard NLP benchmarks in English, but is on par with it on French benchmarks. The difference in performance on English benchmarks can to a large extent be attributed to that Stable LM 2 1.6B has been trained on more than 6x the number of tokens that Mambaoutai has, and that Mambaoutai additionally has a substantially lower percentage of English in the training data because we wanted it to be capable in French and code. We also acknowledge that the decision to increase the model width to achieve higher throughput and having a hidden state that could compress more information from long contexts, can have caused worse reasoning capabilities as the model depth had to be decreased. We encourage further research on this topic for Mamba-like architectures.
6. Conclusion
In this blog post, we have presented the training details of pre-training a relatively large language model based on the newly introduced Mamba architecture. We have shared the results of our explorations to give some insights to practitioners willing to train a Mamba, and additionally insights from the WSD scheduler and positional weighting should be transferable to other model architectures such as transformers, as they are architecture agnostic for autoregressive language modeling.
Firstly, we have validated that the WSD scheduler is a promising alternative to the cosine scheduler as it gives better performance on validation metrics, while additionally enabling smooth continuous pre-training and a custom and adaptable decay phase. Also, we found that positional weighting of the language modeling loss improves performance of the resulting model in terms of top-k accuracy metrics. Another byproduct of using a state space model is the native support for large context lengths which outperformed highly optimized transformer models of the same size on the needle-in-a-haystack task. Finally, we are releasing various checkpoints at different steps of the training, to allow people to evaluate the final performance, continue the pre-training or perform a decay phase more suited to their downstream applications.
Future work:
- Investigate positional weighting further: experiment with more advanced weighting functions, evaluate the impact on downstream performance.
- Test retrieval and long context length beyond the Needle-in-haystack task. Fine-tuning the model on larger contexts is a viable choice if we want to use the model as a document scanner before feeding the retrieved results to a bigger model.
- Train a bigger pure Mamba to validate its effectiveness against the Transformer once and for all.
Limitations:
- Mambaoutai is trained only on French, English and Code. Other languages are not supported.
- Constrained by the model size, it may exhibit hallucinations and repetitions, particularly with longer and more elaborate prompts.
- Constrained by the model capacity, the accuracy of the modelâs knowledge recall is limited.
- The evaluation of the WSD-scheduler and positional weighting is still very limited
To cite this work, please refer to the following bibtex:
@misc{mambaoutai, title={Passing the Torch: Training a Mamba Model for Smooth Handover}, author={Hallström, Oskar and Taghadouini, Said and Thiriet, Clément and Chaffin, Antoine}, url={https://www.lighton.ai/blog/lighton-s-blog-4/passing-the-torch-training-a-mamba-model-for-smooth-handover-54}, year={2024} }
References
[1] Albert Gu, Tri Dao. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. https://arxiv.org/abs/2312.00752 (2023)
[2] MiniCPM: Unveiling the Potential of End-side Large Language Models. https://shengdinghu.notion.site/MiniCPM-Unveiling-the-Potential-of-End-side-Large-Language-Models-d4d3a8c426424654a4e80e42a711cb20
[3] Nanotron. https://github.com/huggingface/nanotron . (as of 28/03/2024)
[4] Jordan Hoffmann et al. Training Compute-Optimal Large Language Models. https://arxiv.org/pdf/2203.15556.pdf
[5] Amirkeivan Mohtashami, Martin Jaggi. Landmark Attention: Random-Access Infinite Context Length for Transformers. https://arxiv.org/abs/2305.16300
[6] Jack W. Rae et al. Scaling Language Models: Methods, Analysis & Insights from Training Gopher. https://arxiv.org/abs/2112.11446
[7] Colin Raffel et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. https://arxiv.org/abs/1910.10683
[8] Marco Bellagente et al.Stable LM 2 1.6B Technical Report. https://arxiv.org/abs/2402.17834
[9] https://www.wired.com/story/dbrx-inside-the-creation-of-the-worlds-most-powerful-open-source-ai-model/
[10] Almazrouei, Ebtesam, et al. The falcon series of open language models. arXiv preprint arXiv:2311.16867 (2023).
[11] Xu et al. Automatic cross-replica sharding of weight update in data-parallel training. arXiv preprint arXiv:2004.13336 (2020).
[12] Rajbhandari, Samyam, et al. âZero: Memory optimizations toward training trillion parameter models.â SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2020.
[13] Micikevicius, Paulius, et al. âMixed precision training.â arXiv preprint arXiv:1710.03740 (2017).
[14] Shoeybi, Mohammad, et al. âMegatron-lm: Training multi-billion parameter language models using model parallelism.â arXiv preprint arXiv:1909.08053 (2019).
[15] Chen, Tianqi, et al. âTraining deep nets with sublinear memory cost.â arXiv preprint arXiv:1604.06174 (2016).
[16] Ibrahim, Adam, et al. âSimple and Scalable Strategies to Continually Pre-train Large Language Models.â arXiv preprint arXiv:2403.08763 (2024).
[17] Ash, Jordan T., and Ryan P. Adams. âOn the difficulty of warm-starting neural network training.â (2019).
[18] DeepSpeed: Extreme-scale model training for everyone. https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/ (2020).
[19] BFloat16: The secret to high performance on Cloud TPUs. https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus (2019).
[20] YaRN: Efficient Context Window Extension of Large Language Models. https://arxiv.org/abs/2309.00071 (2023).
[21] Efficient Streaming Language Models with Attention Sinks. https://arxiv.org/abs/2309.17453 (2023).
[22] Hallström, Oskar et al. âAlfred-40B-1023â https://huggingface.co/lightonai/alfred-40b-1023 (2023)
[23] The Mosaic ML Team. âComposerâ (2021)
[24] Ning Ding et al. Enhancing Chat Language Models by Scaling High-quality Instructional Conversations (2023) https://arxiv.org/abs/2305.14233
[25] Open Language Model: OLMo (2024) https://allenai.org/olmo
[26] Zhai et al. Scaling Vision Transformers (2021) https://arxiv.org/abs/2106.04560
â