What's a Token?
A quick look at base-pair encoding.
When you’re using an LLM via an API, you’ll often see that you are billed by tokens. A token isn’t quite a word, OpenAI’s doc page describes them as:
That gives you some heuristics, but it doesn’t tell you how they come about. I’ll try and do that in this article.
What’s the goal of encoding text?
When we’re encoding text, we’re trying to be efficient. We want to encode as much data as possible in the least data possible, and we want to reproduce it out the other end efficiently.
Why not encode as characters?
You can encode text in many different ways. You might use ASCII where a character is represented by a single byte (only 255 choices), or something like UTF-8 which can represent over a million different values. Using a character level representation suffers from efficiency issues; the size of the data going “in” and “out” would be huge. We want to find something more efficient.
Why not encode as words?
What is a word? Not a deep philosophical question! English Words is a repository containing nearly 500K words, and if you’re building something to support many languages you might end up with a set of tokens in the millions. That trade-off may not be worth the simplicity.
Byte Pair Encoding
Here’s the byte pair encoding algorithm.
Start with a byte representation
If things occur together frequently merge them (e.g.
qu)Repeat step 2, until we’ve got the vocabulary size we want.
Let’s write some code to do that in the most naïve way possible (we’ll assume a character is synonymous with a byte). We’ll loop until no more matches arrive, but it’s easily changed to stop after a specific vocabulary size. I’ve missed off a couple of helper functions, but they do exactly what they say they do!
public static IEnumerable<List<string>> RunBPE(string text)
{
// We start with each character being a separate thing
var currentTokenization = text.Select(c => c.ToString()).ToList();
while (true)
{
var mostFrequentPair = FindMostFrequentPair(currentTokenization);
if (mostFrequentPair.frequency <= 1)
break;
// Update the tokenization by merging the most frequent pair
currentTokenization = MergePairs(currentTokenization, mostFrequentPair.pair);
yield return currentTokenization;
}
yield return currentTokenization;
}
Excuse the yield return , I just wanted an easy way of seeing how it was going. Let’s use aaabdaaabac as an example (ripped from Wikipedia).
a|a|a|b|d|a|a|a|b|a|c| # Start with a character for everything
aa|a|b|d|aa|a|b|a|c| # Create an "aa" token since it occurs twice
aaa|b|d|aaa|b|a|c| # Merge that to create an aaa token
aaab|d|aaab|a|c| # Merge that with the aaab
aaab|d|aaab|a|c| # Nothing left to merge, job done!
And that’s it. The algorithm is as simple as that.
Once you’ve got a list of tokens, give each of them a number and you’ve got your input to a large language model.
Trade-offs
There’s a whole bunch of trade-offs with base pair encoding. What should your desired vocabulary size be? Once you’ve got all the tokens, what if you find something that’s unencodable? You can visualize the trade-offs Open AI have made using Open AI’s Tokenizer.


