-
Nesterov lookahead and Muon 1
Momentum-based optimizers are ubiquitous in deep learning, but their implementation details can be subtle. In this post, I'll walk through classical and Nesterov momentum, comparing PyTorch versions and mathematical formulations and highlighting memory-efficient tricks. We'll use this foundation to understand a recent change in the Muon optimizer and why it was accompanied by a dramatic shift in the recommended learning rate.