TL;DR: see the conclusion.
A few months ago, I was looking at huggingface's tokenization library, and a line in the repo caught my eye: "Extremely fast... [Taking] less than 20 seconds to tokenize a GB of text on a server's CPU." The graph there showed that's with 32 threads. Their highest single threaded performance was ~3 MB/s, with their competitor, tiktoken, getting ~6.5 MB/s single threaded.
Both of these instinctually didn't feel fast to me at all. UTF-8 chars are 1-4 bytes, meaning they're processing only ~1 million chars per second. I wasn't even aware of what tokenization libraries are trying to compute using that input at that point, but it still felt slow because modern processors are comically fast. A single core on a modern processor gets 2.5+ billion cycles per second, instructions can be queued such that one or more instructions are running on every cycle, and each of those instructions might be operating on 16/32/64 bytes at a time using SIMD. Those multipliers alone imply 30+ GB/s per core in the best case, compared to the 3-6 MB/s from the tokenization libraries. I know numbers like those only happen in completely optimal circumstances and on algorithms that have very good cache performance and SIMD usage, but my curiosity was still itching: where had those missing 4 orders of magnitude gone?
It should be noted that tokenization speed past 6.5 MB/s ~doesn't matter at all for money-affecting considerations like LLM API latency. Most requests that an LLM API gets are 1000 chars long at most, and ~1 KB of data at MB/s rates means <1ms of latency from the tokenization. There was a time when these tokenization libraries were written in pure Python and took 10+ ms for these kinds of requests, but that age is over; they're now written in Rust and 1ms is ~imperceivable to humans.
And besides, Andrej Karpathy has already proclaimed: "Eternal glory to anyone who can delete tokenization as a required step in LLMs". Some people are answering the call, so surely tokenization will soon™ be a thing of the past.
The reason I decided to investigate tokenization peformance was purely to see if my intuition (that tokenization libraries were slow) was correct. Casey Muratori talks about how having a good sense for a piece of software's theoretical performacne limit is an important skill for a programmer, as that undergirds a lot strategic decisions you end up making for that software in the future. I agree, and so I periodically want to check that any heuristics I have are correct to at least the same order of magnitude.
I'll be looking at the tiktoken library, a) because I only care about the inference performance they were talking about, not training, and b) because it's simple and easy to work with. I'll not be explaining what the tokenization algorithm is trying to do, but Andrej Karpathy's video on byte-pair encoding (BPE) walks through and motivates tiktoken's code in detail.
Let's look at an example of the algorithm in action. We'll be following along with the setup used in the huggingface benchmark we talked about above, which means we're using Llama 3.1 8b. I'll be ignoring special tokens because the benchmark does too.
Input: | |
|
"Nice weather we're having, innit?" |
"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+" | |
|
" inn" -> 6301
... "it" -> 275 ... ~128k more rows for Llama 3... |
Algorithm: | |
Use the regex pattern to break the input into substrings. | ["Nice", " weather", " we", "'re", " having", ",", " innit", "?"] |
For each substring, we check if there's a token dedicated to the entire substring. Llama has 128k tokens, and so a lot of common words have their own token. |
"Nice" -> 46078
" weather" -> 9282 " we" -> 584 "'re" -> 2351 "having" -> 3515 "," -> 11 "?" -> 30 |
But some don't. | " innit" |
In which case, we take the string's byte representation | [32, 105, 110, 110, 105, 116] |
And convert each of these to token values. | [220, 72, 77, 77, 72, 83] |
Then, we look at every adjacent pair of tokens, and if there's a token for the concatenation of the two strings that each of the pair of tokens represents, we make that replacement (called a "merge" in the BPE code). |
72, 77 -> 258
"i", "n" -> "in" |
[220, 258, 77, 72, 83] | |
72, 83 -> 275
"i", "t" -> "it" |
|
[220, 258, 77, 275] | |
You might notice that the resulting tokens are sorted from lowest to highest. That's because when multiple merges are possible, the algorithm requires that we prioritize the lowest resulting token value. |
220, 258 -> 304
" ", "in" -> " in" |
[304, 77, 275] | |
304, 77 -> 6301
" in", "n" -> " inn" |
|
No more merges are possible after this. | [6301, 275] |
Output: | |
|
[46078, 9282, 584, 2351, 3515, 11, 6301, 275, 30] |
The tiktoken code actually has a decent self-diagnosis for why their code is so slow. One of their (imo counterintuitive) conclusions is that the initial regex parsing of the input is the main performance bottleneck, not the actual BPE tokenization and merging.
Accepting that for a moment, that just pushes the question down one level. "Why is their tokenization slow?" becomes "why is their regex matching slow?", or maybe even "why are they using regex at all?". You really only need to use generic regex matching if you're trying to modify the regex pattern experiment with what regex matching makes your LLM perform best. But after the LLM's creator has finished experimenting, the regex pattern gets frozen in place; they could easily hardcode the behavior of their chosen regex pattern. And besides, the LLM creator is allowed to use a non-regex matching to break up the initial input.
And so, I wondered: how much faster would a hardcoded version of an LLM regex pattern run?
I just went with the Llama 3 regex pattern that was used in the benchmark, cuz it seemed roughly as complicated as the rest of them:
"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+"
This bad boy is 7 match clauses ORed together. Note: regex ORs short-circuit, meaning it tries to find the longest match for any particular clause, and if there was any match available, it doesn't even try to do the other clauses.
"(?i:'s|'t|'re|'ve|'m|'ll|'d)" | Case-insensitive English contractions. |
"[^\r\n\p{L}\p{N}]?\p{L}+" | 1+ letters, probably to match words, optionally preceded by a non-(newline, letter, or number) character. |
"\p{N}{1,3}" | Chunks of 1-3 numbers. |
" ?[^\s\p{L}\p{N}]+[\r\n]*" | 1+ non-(whitespace, letter, or number) characters, probably to match sequences of symbols, optionally preceded by a space. |
"\s*[\r\n]+" | Newlines, optionally preceded by whitespace. |
"\s+(?!\S)" | Whitespace, except we either leave one whitespace unmatched or hit the end of the string. |
"\s+" | Final catch-all whitespace. |
Harcoding this is pretty straightforward. Here are some small performance considerations that I learned:
But by far the biggest intuition break I had as I hardcoded this regex was when I learned that rust's standard functions for determining if unicode characters are numbers/letters/whitespace (char::is_alphabetic(), char::is_numeric() and char::is_whitespace()) did not match what fancy_regex (and most regex libraries) used.
How are they different? Well for example, fancy_regex sees if something's a letter by checking this unicode table. It's a ~700 long sorted array of character ranges, generated by ucd-generate, that you can either linearly scan through or do binary search on.
char::is_alphabetic() does something similar (if with a more optimized form) here, which calls this function using these hardcoded arguments. The main thing to note is just that these two definitions are in different places and, more importantly, not equivalent.
Why are they different? Only the heavens know. Considering both of these libraries are managed by the Rust foundation, I don't know. I know that regex specifications for classes of characters like "letter" aren't specific enough to give unicode character ranges, so it's not that. All I can say is that the ucd-generate unicode tables make for better matches, for example by consistently grouping diacritics with the letters they're modifying.
Regardless, I had to match the behavior of this regex library + pattern combo as well as I could, while still eeking out as much performance as possible. Binary search works best, even if you use branchless SIMD for the linear scan. And yes, I know that further optimizations can be done to massively speedup bulk binary search operations, especially on static data like these unicode tables... But it turns out that rust's binary search is actually pretty decent (branchless loop via predication/conditional move, etc) and I'm pretty sure the is_letter/is_number/is_whitespace checks weren't really the bottleneck after the some of the optimizations I mentioned above. If I have more time, I'll come back to implement those.
After matching the behavior of regex exactly*, I still managed to have a substantially faster system, which was expected considering my harcoded version of one specific pattern can do much fewer things than a fully compliant regex library. Overall, it gets ~5x the regex speed on <100 char inputs and 10-20x on 1000+ char inputs.
*I also learned through my test cases (the case 47 and 48 that fail) that fancy_regex also considers an escaped '\n' to be whitespace. As in "\\n". Why? I have absolutely no clue, but personally, I think fancy_regex is just straight parsing incorrectly here, and so I did not match their behavior.
I dropped this hardcoded version of the regex pattern into the benchmark and only got a ~2.5x speedup, for Amdhal's law reasons, which just meant that the bottlenecking constraint shifted from the regex matching to another part of the tokenization function. A bit of testing quickly showed that it was the core BPE merging that had become the new bottleneck.
This is the code for the merging, which runs on each substring. One thing that surprised me here was that a lot of performance work had already gone into making this code fast:
Honestly, the tiktoken guys did a pretty decent job here.
At the end of this exploration, the tokenization rate feels somewhat more reasonable. Here are the things that my initial intuition had either missed or been wrong about:
So I guess my answer to the title question, "why is LLM tokenization so slow?", is ~"it's not as slow as I thought it was". Hmm.
While I still think it's possible to easily get a solid 10-20x total performance improvement on tokenization from their baseline should LLM API providers ever think it's important, I now feel like I understand why more than that is hard/unreasonable. This might be the first time I've gone through a seemingly extremely unperformant codebase and come back out thinking "huh, they didn't actually do too bad".