Tim Lawson

Nesterov lookahead and Muon 1

19 July 2025

Momentum-based optimizers are ubiquitous in deep learning, but their implementation details can be subtle. In this post, I'll walk through classical and Nesterov momentum, comparing PyTorch versions and mathematical formulations and highlighting memory-efficient tricks. We'll use this foundation to understand a recent change in the Muon optimizer and why it was accompanied by a dramatic shift in the recommended learning rate.

Nesterov lookahead and Muon 1