Muon, muP, and the Compute-Time Tradeoff
Essential AI • Optimization
May 12, 2025
Adam and AdamW are vital ingredients in the pretraining recipe, dominating the landscape of neural network optimizers. Recently, Muon, a surprisingly simple second-order optimizer, has emerged as a potential alternative. In our [paper], we ask the questions:
Is Muon a robust replacement for AdamW?
Does Muon work well with muP, the maximal update parameterization?
We demonstrate that Muon [ Bernstein, Keller, Moonshot] achieves better compute-time tradeoffs than AdamW, especially at large batch sizes. It readily pairs with maximal update parametrization (muP), a lightweight hyperparameter tuning strategy, delivering easy efficiency wins for pretraining LLMs.
Why This Matters
Ultimately, what matters most for a pretraining workload is how much time and compute it takes to reach your target loss based on the resources you have at your disposal. All things being equal, you would prefer an optimizer that can achieve the same loss with fewer tokens, or faster wallclock time.
We discover that Muon accomplishes both.
What Muon Does Differently
Muon is a lightweight second-order optimizer, that can be seen as a special case of Shampoo, under certain assumptions [shampoo-reduction]. It approximates second-order information without storing or inverting large matrices, making it simpler to implement and cheaper to run. It uses a Newton-Schulz iteration to approximate the update direction, and only maintains a first-moment state, making it even leaner than AdamW. The result is a scalable optimizer that remains efficient at large batch sizes.
Key Results
1. Better Compute-Time Tradeoffs
We trained decoder-style transformer models (100M to 4B params) on Python code and general web data (DCLM). Across all settings, Muon reached target losses faster and with fewer tokens than AdamW.
2. Data Efficiency at Scale
At batch sizes up to 16M tokens, Muon needed 10–15% fewer tokens than AdamW to reach the same loss. The relative advantage persists and often grows with batch size.
3. muP Works with Muon
We used muP to transfer hyperparameters from small models to a 3.7B model (sequence length 8192), and the transfer held — both for learning rate and weight decay. We used a "Telescoping" sweep, that narrows the search space as we grow widths. Our approach heuristically controls for search errors and avoid running very expensive full-blown sweeps at large models.
Summary
Muon beats AdamW in compute-time tradeoff across losses and batch sizes
muP works with Muon, allowing scalable hyperparameter transfer. Our contribution, the Telescoping sweep, makes transfer tractable.
Together, they form a practical recipe for large-scale pretraining.
Resources