ml_0: intro theory

Jan 6, 2026

This is the first post in a series about LLM and diffusion model math. I’m writing this because I think it’s valuable for me; I tend to learn well when I’m forced to explain things. Hopefully others will also find this useful or at least interesting.

outline

I plan to gradually build up to modern transformer-based language models, starting with barebones assumptions. Something like:

basics of learning theory (ERM, why we restrict hypothesis classes)
basics of optimization theory (SGD, autograd)
basics of information theory (cross-entropy loss)
the universal approximation theorem

From there, we can start implementing the dumbest models possible, working up to modern transformers. Hopefully this way, each change in architecture feels somewhat motivated. Something like:

char-level MLP -> scaling up training set -> use AdamW -> regularization -> attention

This way, we get that nice poset-y knowledge-dependency graph but without having to take classical ML digressions along the way (SVMs, kernel stuff, Bayesnets, RNNs, etc). These things are definitely useful and nice background but we’re basically trying to speedrun our way to LLMs here. OK enough talk.

specifying the end goal

I’ll take for granted now that the end goal of this series is to implement our own LLM (perhaps just LM- I’m quite GPU poor).

What does that mean? Well, we want a function (fine, program) that takes as input some natural language, and spits out other (useful, interesting, “good”) natural language.

The approach that’s worked well so far is roughly:

train a model to accurately predict internet text
tweak the model to produce helpful, honest, harmless text (RLHF)

For now we’ll just concern ourselves with the first part, so we have a clear goal: accurately predict internet text.

formalizing the end goal

So we want to learn a distribution over text. How do we even approach this?

We have some domain \(Z\) (text sequences), some unknown true distribution \(\mathcal{D}\) over \(Z\) (the internet), a training set \(S\) of samples from \(\mathcal{D}\), a set of candidate models \(\mathcal{H}\) (our hypothesis class), and a loss function \(l\) that measures how bad our predictions are.

We want an algorithm that takes our training set and returns a good model:

\[ A(S) \in \underset{h \in \mathcal{H}}{\mathop{\mathrm{arg\,min}}}\; \mathbb{E}_{z \sim \mathcal{D}}[l(h, z)] \]

In other words: give us the model that minimizes expected loss over the true distribution.

But look carefully- we’ve made a fatal error. We don’t know \(\mathcal{D}\)! We can’t compute expected loss over a distribution we don’t have access to.

This leaves us with literally one move: minimize loss over the sample instead.

\[ A(S) \in \underset{h \in \mathcal{H}}{\mathop{\mathrm{arg\,min}}}\;{L}_S(h) \]

where \({L}_S(h) := \frac{1}{m} \sum_{i=1}^m l(h, z_i)\) is just the average loss on our training data.

This procedure is called Empirical Risk Minimization (ERM), and it forms the basis for virtually every ML algorithm out there. We’ve derived it from having no other option.

tldr, the name of the game is: 1. choose a hypotheis class (~model architecture) 2. choose a training set 3. choose a loss function 4. pick the hypotheis that minimizes the loss on the training set.

picking model architectures

So we minimize loss on the training set. But if we search over all possible functions, we can always find one that perfectly memorizes the training data:

\[ h_{memorize}(x) = \begin{cases} y_i & \text{if } \exists (x_i, y_i) \in S \text{ such that } x = x_i \\ \text{guess} & \text{otherwise} \end{cases} \]

This achieves zero training loss but is useless in the real world. If the true distribution is 50/50 and there are many more examples than we’ve seen, this hypothesis gets 100% training accuracy and ~50% test accuracy (no better than chance).

We call this overfitting: doing well on training data but badly on new data.

We’ll come back to this issue later, for now just keep it in back of mind.

training/validation/test splits

In practice, we don’t literally just minimize loss over all available data. Since we’ll use expressive models that risk overfitting, we split the data:

Training set: what we optimize over
Validation set: used to check generalization and tune hyperparameters
Test set: held out until the very end, to verify we didn’t overfit to validation

For small experiments I usually skip the test set, but the train/val split is essential.

choosing a loss function

We need to choose a function that assigns small numbers to good model outputs and big numbers to bad model outputs.

However, as you may imagine, many such functions would fit the bill.

Hence, we’ll return to this question once we have some idea of the mathematical properties which would be desirable in such a loss function. See part 2 in this series once it’s written.

recap

The basic idea is to minimize a loss function over a training set that’s representative of the distribution we want to model, with respect to some well-chosen model.

In other words, we’ve reduced the problem of modeling language to an optimization problem. The next post will discuss how we might go about optimizing such models, and the sorts of model architectures which are amenable to these methods.