choosing a loss function

Jan 22, 2026

Quick recap up to now:

We’re trying to do language modeling. We know that we’re converting text samples into elements in \(\mathbb{R}^n\), and operating on them with differentiable functions (MLP layers for now), so that our optimizer machinery works.

What’s next?

Well, we want our network to take in some chunk of text, \([t_1, t_2, \dots, t_n]\), and predict the most likely next piece of text \(t_{n+1}\).

What this means is that our model output should be a vector which represents the probabilities of each chunk appearing next in the input sequence. It’s kind of the only reasonable way to make this work.

For example, given the input “the quick brown fox jumps over the lazy”, we want our model to output something like:

\[ \begin{bmatrix} \text{rabbit} \\ \text{fox} \\ \text{green} \\ \text{eliezer} \\ \text{dog} \\ \text{the} \\ \text{SPACE} \end{bmatrix} \to \begin{bmatrix} 0.02 \\ 0.05 \\ 0.01 \\ 0.00 \\ 0.82 \\ 0.03 \\ 0.07 \end{bmatrix} \]

probability conversion

OK first we have a problem. You’ll note that above the probabilities sum to one, and each probability is in the range \((0,1)\). However, we have zero guarantee (and we should not expect) that the network will magically output a vector containing such a valid probability distribution.

So first we need to pass it through a (differentiable) function that fixes this.

First we need to fix the negatives. We probably also want a monotonically increasing function for self-explanatory reasons. The obvious choice is the exponential function.

But then of course we don’t have a valid “summing-to-one” vector and the components can be arbitrarily large.

And again the obvious fix is just to normalize it. I.e. divide each component by the sum of all the components. And so all together we have:

\[ \mathrm{validated}(x)_k := \frac{e^{x_k}}{\sum_{i=1}^n e^{x_i}} \]

And we don’t call it the ‘validated’ function, we call it ‘softmax’. It’s the obvious “make-it-a-real-probability-distribution” function.

loss function choices

Great ok so now we have a nice differentiable function we can throw on the end of the model to end up with a real probability distribution.

Now we need a measure of “how close to the real distribution is this?”

First, note that when we train this on text samples, the “real distribution” for any given piece of text will just be a “one-hot” vector, or a standard basis vector. This is because it’s true that with probability 1 we actually did observe in the training set that this particular piece of text ended in this particular way.

I think you could probably formulate it another way but honestly this really is the most straightforward so i’m going to hope you just buy this.

So the “true distribution” might look like this:

\[ \begin{bmatrix} \text{rabbit} \\ \text{fox} \\ \text{green} \\ \text{eliezer} \\ \text{dog} \\ \text{the} \\ \text{SPACE} \end{bmatrix} \to \begin{bmatrix} 0.00 \\ 0.00 \\ 0.00 \\ 0.00 \\ 1.00 \\ 0.00 \\ 0.00 \end{bmatrix} \]

Let’s call our model prediction \(\hat{y}\) and our true reference output \(y\), both in \(\mathbb{R}^n\).

Recall that we’re trying to minimize loss, so we want to assign large numbers to bad predictions and small (or large, negative) numbers to good predictions.

Immediately we have a strong candidate: the naive dot product, negated. i.e. \(l(\hat{y}, y) = - \langle \hat{y}, y \rangle\).

I say ‘immediately’ because the geometric intuition is solid. If our \(\hat{y}\) is totally orthogonal to \(y\) then we get a zero. If \(\hat{y} = y\) then we get \(-1\). Great.

Consider the scenario for the following three guesses, \(a\), \(b\), and \(c\), against an observation \(y\):

\[ \begin{aligned} y &= [1, 0] \\[0.5em] a &= [0.1, 0.9] \implies \ell = -0.1 \\ b &= [0.5, 0.5] \implies \ell = -0.5 \\ c &= [0.9, 0.1] \implies \ell = -0.9 \end{aligned} \]

So it looks like this function works as a loss function for our problem.

The thing is, no one uses it. Why?

cross entropy loss

The version of this “multi-class classification” loss function you’ll actually see is this:

\[ l(\hat{y}, y) = -\langle \log(\hat{y}), y \rangle = -\sum_{i=1}^n \log(\hat{y}_i) y_i \]

The function is known as cross-entropy loss (CEL).

Note this simplifies to \(\mathrm{CEL}(\hat{y}, y) = -\log(\hat{y}[\mathrm{true index}])\) in the one-hot case.

So why the log? A few reasons.

First: theoretical soundness.

There’s this notion of “entropy” as a quantity which (roughly speaking) measures the uncertainty in a probability distribution. higher entropy = more bits required to ‘describe’ the distribution.

it’s written \(H(p) = - \sum_{i=1}^n p_i \log(p_i)\).

Basically if we just use cross-entropy we get a nice theoretical interpretation of what we’re doing.

The big payoff there is that we can say that minimizing cross-entropy loss is the same as doing MLE for categorical distributions, and that minimizing cross-entropy is equivalent to minimizing KL divergence.

There’s a wealth of resources out there to learn more about the theoretical justifications here. Shannon’s original information theory paper is very digestible and worth reading.

For the sake of this blog post I’m not going to dwell on these facts now, although it definitely is worth a followup at some point. I don’t want to undersell the importance of this theoretical connection, but I think we can reach a minimum viable justification of why we use cross-entropy without a statistical digression at this point in the series.

Second: practical considerations:

consider the same scenario but with our new cross-entropy loss function.

\[ \begin{aligned} y &= [1, 0] \\[0.5em] a &= [0.1, 0.9] \implies \ell = 2.30 \\ b &= [0.5, 0.5] \implies \ell = 0.69 \\ c &= [0.9, 0.1] \implies \ell = 0.11 \end{aligned} \]

Let’s compare this to earlier, using the “confidently correct (c) case as a baseline”

negative dot product loss penalty for being:

cross-entropy loss penalty for being:

Clearly introducing the logarithm has made our loss landscape far more expressive. The hope is that this will give our network a stronger gradient signal when we’re confidently wrong, which is when it’s most needed.

It’s also worth mentioning that doing the softmax exponential function and then taking a logarithm can be exploited by popular ML frameworks like pytorch to improve numerical stability.

If you don’t buy the theoretical reasons, the practical considerations are probably good enough to accept that CEL is the right loss function to choose.

perplexity

One annoying thing about the loss function is that it’s sort of unitless. Technically not really, we could say like “nats” or “bits” but it’s not easy to interpret.

The way out of this is to use a measure called perplexity, defined as:

\[ \mathrm{perplexity} = e^{\mathrm{CEL}} = e^{-\log(\hat{y}_i)} = \hat{y}_i^{-1} \]

Where \(i\) is the index of the true label. The theoretical interpretation is that a perplexity of e.g. 10 means that your model is equally split between choosing 10 options. It’s more meaningful in the non-degenerate non-one-hot case but whatever. It’s the number you want to go down when you stack a bunch of layers and hit model.train.

next steps

Ok! At this point we have some basic procedure for learning established, we’ve got our optimization technique of choice, have a reasonable functional form for our first models, and now a loss function.

Next up, it’s time to test this stuff in practice!