Muon, muP, and the Compute-Time Tradeoff

Essential AI • Optimization

May 12, 2025

Adam and AdamW are vital ingredients in the pretraining recipe, dominating the landscape of neural network optimizers. Recently, Muon, a surprisingly simple second-order optimizer, has emerged as a potential alternative. In our [paper], we ask the questions:

Is Muon a robust replacement for AdamW?
Does Muon work well with muP, the maximal update parameterization?

We demonstrate that Muon [ Bernstein, Keller, Moonshot] achieves better compute-time tradeoffs than AdamW, especially at large batch sizes. It readily pairs with maximal update parametrization (muP), a lightweight hyperparameter tuning strategy, delivering easy efficiency wins for pretraining LLMs.

Why This Matters

Ultimately, what matters most for a pretraining workload is how much time and compute it takes to reach your target loss based on the resources you have at your disposal. All things being equal, you would prefer an optimizer that can achieve the same loss with fewer tokens, or faster wallclock time.

We discover that Muon accomplishes both.

What Muon Does Differently

Muon is a lightweight second-order optimizer, that can be seen as a special case of Shampoo, under certain assumptions [shampoo-reduction]. It approximates second-order information without storing or inverting large matrices, making it simpler to implement and cheaper to run. It uses a Newton-Schulz iteration to approximate the update direction, and only maintains a first-moment state, making it even leaner than AdamW. The result is a scalable optimizer that remains efficient at large batch sizes.

Key Results

1. Better Compute-Time Tradeoffs

We trained decoder-style transformer models (100M to 4B params) on Python code and general web data (DCLM). Across all settings, Muon reached target losses faster and with fewer tokens than AdamW.

Compute Resources vs Training Time Tradeoff for Loss 1.3 Nats showing Muon outperforming AdamW

2. Data Efficiency at Scale

At batch sizes up to 16M tokens, Muon needed 10–15% fewer tokens than AdamW to reach the same loss. The relative advantage persists and often grows with batch size.

Token Ratio Between AdamW and Muon showing increasing advantage at larger batch sizes

3. muP Works with Muon

We used muP to transfer hyperparameters from small models to a 3.7B model (sequence length 8192), and the transfer held — both for learning rate and weight decay. We used a "Telescoping" sweep, that narrows the search space as we grow widths. Our approach heuristically controls for search errors and avoid running very expensive full-blown sweeps at large models.

3D visualization of loss landscape showing weight decay and learning rate relationships

Summary

Muon beats AdamW in compute-time tradeoff across losses and batch sizes
muP works with Muon, allowing scalable hyperparameter transfer. Our contribution, the Telescoping sweep, makes transfer tractable.

Together, they form a practical recipe for large-scale pretraining.

Resources

📄

Full Paper on arXiv

📄

Muon is Scalable for LLM Training