Blog Posts | LightOn

1 Article

research ×

Passing the Torch: Training a Mamba Model for Smooth Handover

TL;DR We experiment with the Warmup-Stable-Decay (WSD) learning rate scheduler and a novel positional weighting of the loss for language model pre-training; We find that WSD outperforms the cosine sch...

research technical

Apr 10, 2024

Web Analytics