scaling & tokenization

Jan 27, 2026

recap

Last time we trained a simple naive character level mlp on the complete works of william shakespeare.

We were able to get the validation perplexity down to 4.7 after ‘seeing’ about 600M characters.

Sample generations looked like:

PROMPT:  ' oaths in one,\nI and the justice of my love would make thee\nA confessed traitor! O thou most perfidious\nThat ever gently looked,'
OUTPUT:  ' to my father’s day.\n\nBENEDICK.\nBelieve me not seen a sport and do you?\n\nKING HENRY.\nWhat are you to do. I had by the sent your land,\nI will not be more to do in a fair and man to any man so much as w'

Not horrible, and we were beating the n-gram baseline. But we were overfitting after looping over the same dataset about 3 times. Now the first task is to see what happens when we scale up.

scaling

Recall from ml1 that a broader hypothesis class needs more samples to generalize, so our 200M parameter model on 5MB was almost certainly data-starved.

To un-starve the model, I selected a few high quality titles on project gutenberg and compiled a new dataset. It’s about 50mb, or about 10x larger than the 5mb shakespeare file.

The network config looks like this:

ctx_len = 128
emb_dim = 64
hidden_dim = emb_dim * ctx_len + 64
n_layers = 4 # (including the final layer)
n_params ~= 200M

And after ‘seeing’ about 5.1B characters (3 epochs on the new, larger dataset) , we were able to get the validation perplexity down to 3.56, and the generation quality looks improved.

It’s worth noting though that this new ppl metric isn’t apples-to-apples with the old one… The shakespeare.txt set had 99 chars in the vocab, the new bigger set has 254, so the ppl will be different. We’ll return to this for an accurate comparison later.

PROMPT:  ' resolute determination. Dounia put implicit faith in his carrying out his plans and indeed she could not but believe in him. He'
OUTPUT:  ' was remarked the last time he who would put the wonders which he did all the chill in passing and hurled with his hands with it.\n Now, as it were, inside that resting hooks On: entreating over the sa'

At this point we could just keep scaling the dataset and see what happens. But we’re already hitting a bit of a scaling wall, one that I haven’t mentioned up to now: time.

It takes about a half an hour to train our model on ~50mb of text, and the model is only about 200M parameters. And this is with some nice tensor core utilization under the hood on an RTX 4090.

One option is to revisit our overall architecture. There’s good reason to be suspicious that the mlp structure is really inefficient, but we’ll return to that. For now, there’s an obvious optimization: make the text representation more efficient.

tokenization

Right now our model ‘sees’ as input a one-hot vector of dimension ~100, which is fed into our learned embedding matrix. (really pytorch just sees the index, and doesn’t bother with the one-hot, but it’s still a member of a set with 100 elements as input)

From there, it’s squashed into an embedding of dimension 64. The theory for the compression here is that most of the characters are pretty rare. The core characters we need are the 26 letters of the alphabet in lowercase, and another 26 uppercase for 52 total; the special characters matter less for the average prediction and can probably fit in the rest of the embedding space.

The problem with character-level input is that most characters are highly predictable given their neighbors — the model is spending a lot of its capacity on near-certain predictions like the ‘h’ after ‘t’. We’d rather the model spend its compute on harder, more meaningful decisions.

The key observation is that certain words like ‘the’ or subwords like ‘ly’ or ‘st’ which occur frequently could be added to the list of unique inputs. this would let the model think more in terms of units of meaning and would drastically compress the size of the training set, at the cost of a higher dimension in the input representation.

in pseudocode, it looks like this:

vocab = [all unicode chars or all ascii chars or some base set]
corpus = load_your_training_data_text()
rules = []

def build_tokenizer(N_MERGES):
    for i in range(N_MERGES):
        counts = count_adjacent_pairs(corpus)
        pair = most_common_pair(counts)
        rules.append(pair)
        vocab.append(stringify(pair))
        apply_merge(corpus, pair)       # replace all occurrences in-place

    return rules, vocab

def tokenize(text_input, rules, vocab):
    for rule in rules:
        for idx, (c1, c2) in sliding_char_window(text_input):
            if (c1, c2) == rule:
                replace(text_input[idx, idx+1], c1+c2)

    return [vocab[tkn] for tkn in text_input]

I wrote my own tokenizer, which uses some more efficient data structures (but is by no means ‘fast’) which looks like this, running for a thousand merges on shakespeare.txt: (1 merge = 1 new token)

# NOTE: </w> = space

SHAKESPEARE - First 30 merges:
  1: e      + </w>   -> e</w>        (count: 2528)
  2: e      + </w>   -> e</w>        (count: 2528)
  3: ,      + </w>   -> ,</w>        (count: 1827)
  4: t      + </w>   -> t</w>        (count: 1762)
  5: s      + </w>   -> s</w>        (count: 1597)
  6: y      + </w>   -> y</w>        (count: 1243)
  7: i      + n      -> in           (count: 1227)
  8: d      + </w>   -> d</w>        (count: 1188)
  9: o      + u      -> ou           (count: 963)
 10: r      + </w>   -> r</w>        (count: 927)
 11: n      + </w>   -> n</w>        (count: 735)
 12: o      + </w>   -> o</w>        (count: 735)
 13: e      + a      -> ea           (count: 653)
 14: e      + ,</w>  -> e,</w>       (count: 580)
 15: e      + r      -> er           (count: 559)
 16: a      + n      -> an           (count: 549)
 17: l      + l      -> ll           (count: 518)
 18: f      + </w>   -> f</w>        (count: 511)
 19: h      + a      -> ha           (count: 504)
 20: o      + r      -> or           (count: 448)
 21: e      + s      -> es           (count: 428)
 22: th     + </w>   -> th</w>       (count: 427)
 23: in     + g      -> ing          (count: 395)
 24: e      + n      -> en           (count: 394)
 25: .      + </w>   -> .</w>        (count: 382)
 26: l      + o      -> lo           (count: 375)
 27: th     + e      -> the          (count: 360)
 28: th     + e</w>  -> the</w>      (count: 355)
 29: m      + y</w>  -> my</w>       (count: 351)
 30: i      + s</w>  -> is</w>       (count: 347)

Final vocab size: 916
Most common tokens: [('</w>', 891), ('t</w>', 537), ('e</w>', 480), ('a', 465), ('I', 412), ('s</w>', 373), ('the</w>', 355), ('my</w>', 351), ('th</w>', 347), ('of</w>', 339)]

Encoding 'From where thou art, why should I haste me thence?':
Tokens: ['Fr', 'om</w>', 'where</w>', 'thou', '</w>', 'art,</w>', 'why</w>', 'shou', 'ld</w>', 'I', '</w>', 'ha', 'st', 'e</w>', 'me</w>', 'then', 'ce', '?</w>']
Compression: 50 chars -> 18 tokens

Timing full corpus encoding (100000 chars)...
Full corpus stats:
  Input: 100000 chars
  Output: 33195 tokens
  Compression ratio: 3.01x
  Time: 7.03s (14217 chars/sec)

The link to my tokenizer source is here if you’re interested.

The key thing to note is that we get a ‘compression’ ratio of 3x!

In other words, the same number of ‘inputs’ to our network will convey 3x the information.

The downside of this compression is that now instead of inputs being dimension 100, they’re dimension 916 (for 1000 merges on a small dataset).

We don’t have to pay this price throughout the network; we only have to pay it in getting the tokenized input into our embedding layer, and from there the rest of the network just sees vectors of size embedding_dimension.

Production language models use vocab sizes ~50,000, and embedding dimensions as small as 768.

I don’t think it’s obvious that you could ‘compress’ the input information down so much, but you don’t need one dimension per vocab entry. You just need enough dimensions that semantically distinct tokens end up in distinct regions, and a few hundred dimensions gives you an astronomical number of such regions.

running the tokenized model

First, i’ll note i’m running the tokenizer from the ‘sentencepiece’ library because I don’t trust my own tokenizer to be as efficient as I want.

Here are the settings (keeping the network architecture the same otherwise)

vocab_size = 2048
ctx_len = 32
batch_size = 4096
embedding_dim = 256
# OLD: ctx_len * embedding_dim = 128 * 64 = 8192
# NEW: ctx_len * embedding_dim = 32 * 256 = 8192
# ... account for ~4x compression in tokenization,
# ... so we keep the effective ctx size the same, but
# ... in theory we need more expressive embeddings now

Running the model with tokenized input on the scaled up dataset yields this sample generation at 3 epochs in:

PROMPT:  'sprang up and ran towards the house, as if they were frightened at the sight of man; whilst two large dogs,'
OUTPUT:  'which was in the midst of the street, and it was not exactly as the other, and in the same manner as it were. The French had been solid as a little, in which they had a directly existed, and had a couple of yellow, and had not been so much as a sobbles, and had just seen her a great degree of giving him a functionar, for the last few days. She was in love with her, and she put on her tomorrow, and in her brother’s face. “If you wouldn’t come to see me?... Well, I’ll go and see him.... He’s a piece of good ⁇ by, and I’m going to go away.” “To the bottom, ain’t it, and I’m going away. I’ve been here with'

Additionally, it only took ~10 minutes to get to this point. That 3-4x compression ratio shows up almost linearly in the time to train; we’re able to chew through samples much quicker. Or, more accurately, there are about 1/3 as many batches per epoch that we have to compute.

The perplexity here is 66.6, which looks horrific, but again this isn’t apples-to-apples. Earlier the model was predicting between 254 options, and now it’s choosing between 2048 options.

final scaled up run:

I did another training run over a 500mb dataset and we were able to get down to a validation perplexity of 25.5, with the following sample generations:

v3 | INFO | 2026-01-27 20:35:04 | Sample generations (temperature = 0.5):
PROMPT:  'of the sea-shore, or on some lofty eminence which overlooked the continents of Europe and Asi'
OUTPUT:  'a, had been the most famous in the campaign of 1812, and that the story of the Turkish cavalry had been purchased. After the first two years following the post of the battle of Holland. The Portuguese were the centre of the street, and there, on the other hand, was the fittest. The window was a large, bending over the fire, and the figure was snow-white, with a few steps, searched in the midst of the moon, which was to be seen. The camp is a piece of black cloth, which, however, was not a skeleton, the market of which the water was to be thick. The sun was shining; the sky was black and'

PROMPT:  'ia interromput la meva contemplació; però quan, enlluernat de tanta de ll'
OUTPUT:  'una, y els ulls, els plantados, que, como dejarse cuando aparecer, el cuadro de la fuerza de la cual se venden. El cinco, una cuya, de la cabecera, de la luz, suya, y de la masa, dejó de su esposo, de los cuatro, y de traspas, el cual, con el cual, uno de las cantidades, de la pena, uno de las arteras que el fundamento de la ciencia y de la lingua de Carretas, de ellas, de estas cuanto y no se encuentran en la materia'

PROMPT:  '} and examine the goodness of the Ore in small quantities? 71. Whether, when they work in _great_,'
OUTPUT:  'is the divine man. What is it that we are in motion? Why do we not? What is the use of man? What is it? What is it that is? Is it that is not true? What is it? What is it that is? Where is it? What is it that is not what it is? What is it? Is it that it is? What is it? What is it that is not what it is? What is it that it is? What is it that is not to be? It is it that it is a mistake?—He will it be moved thee, which is the libation of a prince, is it not, as it were, the landscape, but of the whole, and the hexagonal of the other, there is no other means of contracting a new one. The Senator of the Comte de la'

I think this final run makes the limitations of our current architecture pretty clear. This dataset was absolutely not curated. I simply pulled random files out of the project gutenberg text file dump. So the model is trying to learn several languages at once.

I get the impression that it’s just memorizing common phrases, since the pattern appears across multiple languages: generally correct words, locally plausible grammar, no real meaning. It’s funny because people lodge this complaint against modern state-of-the-art language models: “they’re just stochastic parrots they have no true understanding” etc etc… But in this case, I think it’s pretty accurate. It looks like the model is just internally replicating a more complicated n-gram model. Of course this is unsubstantiated, we’re not using interpretability techniques to figure out what’s happening, but I think it’s clear we need to introduce some more effective inductive bias in the architecture. This is not going to scale. In the last training run the validation perplexity bottomed out at 25, but soon thereafter started climbing again.

The final big run had an ending train perplexity of 16.74 and validation perplexity of 27.28

final comparison between scales and text encodings:

Perplexity isn’t directly comparable across different vocab sizes. To normalize, we convert to bits-per-character (BPC):

\[\text{BPC} = \frac{\log_2(\text{ppl})}{\text{compression ratio}}\]

For char-level models, one prediction = one character, so compression ratio is 1. For our tokenized models, one prediction covers ~3.3 characters on average.

Run	PPL	Compression	BPC
char, 5MB	4.7	1	2.23
char, 50MB	3.56	1	1.83
token, 50MB	66.6	3.3	1.84
token, 500MB	25	3.3	1.41

using best ppl during training, not final

The tokenized 50MB and char 50MB runs are nearly identical (1.84 vs 1.83 BPC), which is a nice sanity check: same data, same model capacity, different encoding, same information-theoretic performance.

Also at this point I think we have good justification to investigate other model architectures… Tokenization does give us a huge speedup in wall-clock training time but there’s no meaningful gain in model performance; the MLP architecture is fundamentally unable to use the better encodings to improve prediction quality.