AdamW

Mar 1, 2026

The last post left off with a snippet of a training log, and we post the question “what’s wrong with our current setup?”

Look at the following two log chunks:

embedding.weight                    g=3.1e-05 w=1.0e+00 u=1.6e-06
blocks.0.attention.heads.0.W_Q      g=5.5e-05 w=5.8e-02 u=4.8e-05
blocks.0.attention.heads.0.W_K      g=5.4e-05 w=5.9e-02 u=4.6e-05
...

blocks.1.mlp.net.2.bias             g=1.2e-03 w=1.3e-02 u=4.5e-03
blocks.1.mlp.net.4.weight           g=1.0e-03 w=2.8e-02 u=1.9e-03
...
blocks.3.mlp.net.4.weight           g=7.2e-04 w=2.9e-02 u=1.2e-03
blocks.3.mlp.net.4.bias             g=1.3e-03 w=2.1e-02 u=3.2e-03
blocks.3.rms1.gamma                 g=2.0e-03 w=8.7e-01 u=1.1e-04

Sample generations (temperature = 0.0):

PROMPT:  'pleasure contrary to his Reason, the former feels but does not yield to it. Like again are the man of Imperfect'
OUTPUT:  'ly punish’d. Theirs are not to be found. Their figure is to be found. Theirs are not to be found. Theirs are not to be found. Theirs are not to be found. Their population is not to be found in the same way. The same is the same with which the same is the same, and the same is the same with the same, and the same is'

fixing sgd: Adam

The first issue is that gradient norms vary by orders of magnitude across parameters. Gradients to our first attention layer are two OOMs smaller than the gradients to the first MLP.

Is this a problem? Well, stochastic gradient descent assigns gradient magnitudes based on the immediate loss landscape, not based on the distance to the optimum. So a smaller gradient doesn’t mean “we’re done” it literally just means “the local loss landscape around this parameter has less curvature”. So yeah, it could definitely be a problem.

Think about it- the update applied to each parameter is some constant fraction of the actual computed gradient. Recall the sgd formulation: \(\theta_t = \theta_{t-1} - \eta \nabla_\theta f(x,\theta_{t-1})\)

Imagine the optimal value of \(\theta\) sits at the bottom of a very steep hill. Then the gradient magnitudes are going to remain large, and the actual computed \(\theta\) is going to bounce around wildly, despite being “close” to the optimum in the context of the overall loss landscape.

On the other hand, if \(\theta\) is “far away” from the optimum, but the local loss topography is “shallow”, (ie the gradient magnitude is small), then it’s going to take forever for \(\theta\) to approach the optimum.

So currently we take steps proportional to the curvature of the local loss landscape. But it looks like what we want to do is take steps proportional to how far away from the optimum we are. And from that last example, we already have some intuition for how we might go about doing this:

In pseudocode, we’d write something like:

def get_gradient_scale(gradient_history):
    if are_generally_pointing_in_the_same_direction_for_a_while(gradient_history):
        return BIG_STEPSIZE   # high confidence we know where to go, can speed up
    else:
        return SMALL_STEPSIZE # gradients change wildly, uncertain or close to optimum

We just have to figure out how to formalize this. The first question is like … what is BIG_STEPSIZE and what is SMALL_STEPSIZE?

The nice thing about vanilla SGD is that we learn this from the function itself. It’s a method that works (subject to ‘oops, stuck in local minima’) over the massive class of real-valued differentiable functions.

That is, it works for functions like this:

\(f(x,y) = (x - 0.01)^2 + (y - 0.02)^2\) (optimum at small values)

and for functions like this:

\(f(x,y) = (x - 100)^2 + (y - 200)^2\) (optimum at large values)

The gradient at distance \(d\) from optimum is \(2d\), so it scales naturally. SGD with lr=0.1 works for both if you init reasonably.

But we don’t care about general functions here! We’re training a very particular class of neural network. And we’ve empirically observed in the training logs that all of our parameters are near the \(10^{-2}\) range.

So why not simply hardcode some reasonable base stepsize \(\alpha := 0.001\)?

Then for each (scalar) parameter, the update rule is just:

\[\theta_t = \theta_{t-1} - \alpha \cdot \mathrm{scale}(\nabla_1,\nabla_2, \dots, \nabla_{t-1})\]

Where \(\mathrm{scale}\) is some function of the gradient history, where:

\(\mathrm{scale} \approx 1\) when we’re taking ‘confident’ steps towards \(\infty\),

\(\mathrm{scale} \approx -1\) when we’re taking ‘confident’ steps towards \(-\infty\),

and \(\mathrm{scale} \approx 0\) when we’re taking ‘uncertain’ steps in either direction.

Admittedly this notation may be somewhat confusing. Here’s a sketch:

noise to signal ratio

The big question is just like … how do we mathematically express this magic \(\mathrm{scale}\) term as a function of the gradient history?

The first thing to do is figure out a way to determine how much directional noise we’re experiencing. A simple approach would just be to take the moving average. Entries in the gradient history which are additive inverses of one another cancel themselves out and push the number closer to zero, which is a good starting point in our search for a suitable expression.

Let \(g_t\) be the scalar gradient value of some entry in a weight matrix at step \(t\), and \(\mathrm{EMA}\) be the exponential moving average (exponential is easier to compute and more stable than a simple sliding-window moving average)

then: \[\mathrm{scale} = \mathrm{EMA}(g_t, g_{t-1}, \dots, g_1)\]

From there, we can normalize to the range \([-1,1]\) by dividing and taking care of the signs:

\[\mathrm{scale} = \frac{\mathrm{EMA}(g_t, g_{t-1}, \dots, g_1)} {\sqrt{\mathrm{EMA}(g_t^2, g_{t-1}^2, \dots, g_1^2)}}\]

Let’s pause. If the gradient history (recall we’re looking at one entry in a vector so these are scalars) looks like: [1,-1,1,-1,...], then the EMA on the top is near 0 (with many entries, small EMA \(\beta\) parameter), and 1 on the bottom. If the gradient history is [1,1,1,1,...], the numerator is 1 and the denominator is 1. Good.

Let’s formalize it a bit and see if anything sticks out:

Let \(\alpha, \beta \in \R\). Set \(m_0, v_0 = 0\).

Loop over \(t\) until satisfied: \[ \begin{align} g_t &\leftarrow \nabla_\theta f_t(\theta_{t-1})\\ m_t &\leftarrow \beta \cdot m_{t-1} + (1-\beta) \cdot g_t\\ v_t &\leftarrow \beta \cdot v_{t-1} + (1-\beta) \cdot g_t^2\\ \theta_t &\leftarrow \theta_{t-1} - \alpha \cdot m_t / (\sqrt{v_t} + \epsilon)\\ \end{align} \]

Note what we’re doing. We’re computing the first moment (mean) on the top and the second moment (variance) on the bottom. We’re also computing moving averages, so these moments will be biased towards zero at the start. We can correct for this (don’t worry too much about this part):

\[ \begin{align} g_t &\leftarrow \nabla_\theta f_t(\theta_{t-1})\\ m_t &\leftarrow \beta \cdot m_{t-1} + (1-\beta) \cdot g_t\\ v_t &\leftarrow \beta \cdot v_{t-1} + (1-\beta) \cdot g_t^2\\ \hat{m_t} &\leftarrow m_t / (1 - \beta^t)\\ \hat{v_t} &\leftarrow v_t / (1 - \beta^t)\\ \theta_t &\leftarrow \theta_{t-1} - \alpha \cdot \hat{m_t} / (\sqrt{\hat{v_t}} + \epsilon)\\ \end{align} \]

But wait - why use the same \(\beta\) for both \(m\) and \(v\)? They’re doing different jobs.

\(m\) tracks direction - “where are we heading right now?” This should adapt relatively quickly. If the loss landscape changes, we want to notice within like ~10 steps.

\(v\) tracks scale - “what’s the typical magnitude of gradients for this parameter?” This should be stable. If the denominator \(\sqrt{v}\) fluctuates a lot, our effective learning rate becomes erratic. We want this to change slowly over training.

So we use a lower \(\beta_1 = 0.9\) for momentum (effective window ~10 steps, more responsive) and a higher \(\beta_2 = 0.999\) for the scale estimate (effective window ~1000 steps, very stable).

Set \(\beta_1 = 0.9\), \(\beta_2 = 0.999\) and do:

\[ \begin{align} g_t &\leftarrow \nabla_\theta f_t(\theta_{t-1})\\ m_t &\leftarrow \beta_1 \cdot m_{t-1} + (1-\beta_1) \cdot g_t\\ v_t &\leftarrow \beta_2 \cdot v_{t-1} + (1-\beta_2) \cdot g_t^2\\ \hat{m_t} &\leftarrow m_t / (1 - \beta_1^t)\\ \hat{v_t} &\leftarrow v_t / (1 - \beta_2^t)\\ \theta_t &\leftarrow \theta_{t-1} - \alpha \cdot \hat{m_t} / (\sqrt{\hat{v_t}} + \epsilon)\\ \end{align} \]

And that’s the Adam optimizer! Note that these moving averages are cheap to compute, and give us full per-parameter adaptive learning rates.

Let’s try it out! We’ll start with the same learning rate we used for SGD. And conveniently in pytorch this is a one line change.

optimizer = optim.Adam(params = model.parameters(), lr = 0.05)

empirical results

Let’s start training!

adam | INFO | 2026-03-01 16:58:54 | starting training:
n_params=23.1M, n_batches=3548, batch_size=4096, context_length=33, tokens_per_batch=135.2K, total_tokens=1.4B
Traceback (most recent call last):
  File "/home/anon/projects/ml2/ml9/adam.py", line 509, in <module>
    main()
  File "/home/anon/projects/ml2/ml9/adam.py", line 506, in main
    train(training_state, logger)
  File "/home/anon/projects/ml2/ml9/adam.py", line 439, in train
    state.log_batch(batch_no + 1, logger)
  File "/home/anon/projects/ml2/ml9/adam.py", line 330, in log_batch
    'train_ppl': round(math.exp(self.train_loss), 3),
                       ^^^^^^^^^^^^^^^^^^^^^^^^^
OverflowError: math range error

Hmmm. Yeah that’s not going to work. Let’s try the default lr=1e-3.