Passing the Torch: Training a Mamba Model for Smooth Handover TL;DR We experiment with the Warmup-Stable-Decay (WSD) learning rate scheduler and a novel positional weighting of the loss for language model pre-training; We find that WSD outperforms the cosine sch... research technical Apr 10, 2024