Hi everyone, today we are going to look at Tokenization in Large Language Models (LLMs). Sadly, tokenization is a relatively complex and gnarly component of the state of the art LLMs, but it is necessary to understand in some detail because a lot of the strange things of LLMs that may be attributed to the neural network or otherwise appear mysterious actually trace back to tokenization.

Previously: character-level tokenization

So what is tokenization? Well it turns out that in our previous video, Let's build GPT from scratch, we already covered tokenization but it was only a very simple, naive, character-level version of it. When you go to the Google colab for that video, you'll see that we started with our training data (Shakespeare), which is just a large string in Python:

But how do we feed strings into a language model? Well, we saw that we did this by first constructing a vocabulary of all the possible characters we found in the entire training set:

Byte Pair Encoding (BPE)

In practice, state-of-the-art language models use a lot more complicated schemes for constructing token vocabularies. They deal on the chunk level rather than the character level, and these character chunks are constructed using algorithms such as the byte pair encoding algorithm, which we will cover in detail in this video.

The GPT-2 paper introduced byte-level encoding as a mechanism for tokenization in the context of large language models:

Tokens are the fundamental unit of large language models. Everything is in units of tokens and tokenization is the process for translating strings or text into sequences of tokens and vice versa.

Tokenization issues

Tokenization is at the heart of much weirdness in LLMs. A lot of issues that may look like they are with the neural network architecture or the large language model itself are actually issues with the tokenization. For example:

Tokenization by Example in a Web UI (Tiktokenizer)

Tokenization is at the heart of the weirdness of Large Language Models (LLMs). We shouldn't brush it off, as it is necessary to understand in some detail because a lot of the strange things about LLMs may be attributed to the neural network or otherwise appear mysterious actually trace back to tokenization.

Exploring the Tiktokenizer Web App

Let's explore the Tiktokenizer web app. What's great about it is that tokenization runs live in your browser in JavaScript. You can type something like "hello world", and it shows the whole string broken into tokens:

On the left is the input string, and on the right we see it tokenized using the GPT-2 tokenizer into 3 tokens: "hello" (token 31373), " " space (token 318), and "world" (token 984).

Tokenization of Numbers, Code, and Non-English Text

Numbers are handled in an arbitrary way - sometimes multiple digits are a single token, sometimes individual digits are separate tokens. The LLM has to learn from patterns in the training data that these represent the same concept.

Code, especially Python, is tokenized inefficiently by GPT-2. Each space of indentation becomes a separate token, wasting sequence length. GPT-4 handles this much better by grouping indentation spaces.

Non-English languages like Korean tend to get broken into more tokens compared to the equivalent English text. This stretches out the sequence length, causing the attention mechanism to run out of context.

GPT-4's Improved Tokenizer

The GPT-4 tokenizer (CL100K_base) has roughly double the number of tokens compared to GPT-2 (50K vs 100K). This allows the same text to be represented in fewer tokens, effectively doubling the context length the attention mechanism can utilize.

GPT-4 also deliberately groups more whitespace into single tokens, especially for Python indentation. This densifies the code representation, allowing the model to attend to more useful context when generating the next token.

In summary, many of the quirks and limitations of LLMs can be traced back to details of the tokenizer used. The improvements from GPT-2 to GPT-4 are not just from the model architecture, but also the more efficient tokenizer enabling the model to effectively utilize more context.

Tokenization in Large Language Models

When training large language models (LLMs), we need to take strings and feed them into the models. For that, we need to tokenize the strings into integers from a fixed vocabulary. These integers are then used to look up vectors in an embedding table, which are fed into the Transformer as input. This process gets tricky because we don't just want to support the simple English alphabet, but also different languages and special characters like emojis.

Strings in Python and Unicode Code Points

In Python, strings are immutable sequences of Unicode code points. Unicode code points are defined by the Unicode Consortium as part of the Unicode standard, which currently defines roughly 150,000 characters across 161 scripts. The standard is very much alive, with the latest version 15.1 released in September 2023.

We can access the Unicode code point for a single character using the ord() function in Python. For example:

However, we can't simply use these raw code point integers for tokenization, as the vocabulary would be too large (150,000+) and unstable due to the evolving Unicode standard.

Unicode Byte Encodings

To find a better solution for tokenization, we turn to Unicode byte encodings like ASCII, UTF-8, UTF-16, and UTF-32. These encodings define how to translate the abstract Unicode code points into actual bytes that can be stored and transmitted.

Unicode Byte Encodings, ASCII, UTF-8, UTF-16, UTF-32

The Unicode Consortium defines three types of encodings: UTF-8, UTF-16 and UTF-32. These encodings are the way by which we can take Unicode text and translate it into binary data or byte streams.

UTF-8 is by far the most common encoding. It takes every single Unicode code point and translates it to a byte stream between one to four bytes long, depending on the code point. The first 128 code points (ASCII) only need one byte. The next 1,920 code points need two bytes to encode, which covers the remainder of almost all Latin-script alphabets. Three bytes are needed for the remaining 61,440 code points of the Basic Multilingual Plane (BMP). Four bytes cover the other planes of Unicode, which include less common CJK characters, various historic scripts, and mathematical symbols.

The Wikipedia article on UTF-8 encoding shows how different ranges of Unicode code points map to byte sequences of varying length.

UTF-16 and UTF-32, while having some advantages like fixed-width encoding, are significantly more wasteful in terms of space, especially for simpler ASCII characters. Most standards explicitly favor UTF-8.

In Python, we can use the encode() method on strings to get their UTF-8 byte representation:


In [64]: list("안녕하세요 👋 (hello in Korean)".encode("utf-8")) 
Out[64]: [236,
          149,
          136,
          235,
          133,
          149,
          237,
          149,
          152,
          236,
          132,
          184,
          236,
          154,
          148, 
          240,
          159,
          145,
          139,
          32,
          240,
          159,  
          140,
          136,
          108,
          101,
          108,
          108,
          111,
          32,
          105,
          110,
          32,
          75,
          111,
          114,
          101,  
          97,
          110,
          41]

However, directly using the raw UTF-8 bytes would be very inefficient for language models. It would lead to extremely long sequences with a small vocabulary size of only 256 possible byte values. This prevents attending to sufficiently long contexts.

The solution is to use a byte pair encoding (BPE) algorithm to compress these byte sequences to a variable amount. This allows efficiently representing the text with a larger but tunable vocabulary size.

Daydreaming: Deleting Tokenization in Language Models

In this article, we explore the idea of removing tokenization from language models, as proposed in a paper from the summer of last year. While this would be an amazing achievement, allowing us to feed byte streams directly into our models, it comes with some challenges.

The Problem with Long Sequences

One of the main issues with removing tokenization is that attention becomes extremely expensive for very long sequences. To address this, the paper proposes a hierarchical structuring of the Transformer architecture that could allow feeding in raw bytes.

Figure 1. Overview of MegaByte with patch size. A small local model autoregressively predicts each patch byte, using the output of a larger global model to condition on previous patches. Global and Local inputs are padded by a token respectively to avoid leaking information about future tokens.

Establishing the Viability of Tokenization-Free Modeling

The paper concludes by stating that their results "establish the viability of tokenization-free autoregressive sequence modeling at scale." However, this approach has not yet been proven out by sufficiently many groups and at a sufficient scale.

The Current State of Affairs

For now, we still need to compress text using the Byte Pair Encoding (BPE) algorithm before feeding it into language models. The BPE algorithm is not overly complicated, and its Wikipedia page provides a good walkthrough.

Tokenization-free modeling would be a significant breakthrough, and hopefully, future research will make it a reality. Until then, we must rely on established methods like BPE to preprocess our input data.

Byte Pair Encoding (BPE) Algorithm Walkthrough

The Byte Pair Encoding (BPE) algorithm is quite instructive for understanding the basic idea of tokenization. Let's walk through an example to see how it works.

Example

Suppose we have a vocabulary of only four elements: a, b, c, and d. Our input sequence is:

The sequence is too long, and we'd like to compress it. The BPE algorithm iteratively finds the pair of tokens that occur most frequently and replaces that pair with a single new token.

In the first iteration, the byte pair "aa" occurs most often, so it will be replaced by a byte that is not used in the data, such as "Z". The data and replacement table become:

Compression Result

After going through this process, instead of having a sequence of 11 tokens with a vocabulary length of 4, we now have a sequence of 5 tokens with a vocabulary length of 7.

The BPE algorithm can be applied in the same way to byte sequences. Starting with a vocabulary size of 256, we iteratively find the byte pairs that occur most frequently, mint new tokens, append them to the vocabulary, and replace occurrences in the data. This results in a compressed dataset and an encoding/decoding algorithm.

LLM Tokenization

In this chapter, we will start the implementation of tokenization in large language models (LLMs). Tokenization is a crucial component of the state of the art LLMs, but it is necessary to understand in some detail because a lot of the shining results of LLMs that may be attributed to the neural network or otherwise actually trace back to tokenization.

Unicode Tokenization

To get the tokens, we take our input text and encode it into UTF-8. At this point, the tokens will be a raw bytes single stream of bytes. To make it easier to work with, we convert all those bytes to integers and create a list out of it for easier manipulation and visualization in Python.

The original paragraph has a length of 533 code points, but after encoding into UTF-8, it expands to 616 bytes or tokens. This is because while many simple ASCII characters become a single byte, more complex Unicode characters can take up to four bytes each.

Finding the Most Common Byte Pair

As a first step in the algorithm, we want to iterate over the bytes and find the pair of bytes that occur most frequently, as we will then merge them.

There are many different ways to approach counting consecutive pairs and finding the most common pair. Here is one implementation in Python:

This function takes the list of token integers, counts the frequency of each consecutive pair, and returns the pair that appears most often. This is a key step in the byte pair encoding algorithm used for tokenization in many LLMs.

Finding Most Common Consecutive Pairs in Tokenized Text

In this chapter, we will explore how to find the most commonly occurring consecutive pairs in a list of tokenized integers. We'll implement a function called get_stats that takes a list of integers and returns a dictionary keeping track of the counts of each consecutive pair.

Implementing get_stats

This function uses zip(ids, ids[1:]) to iterate over consecutive elements of the input list in a Pythonic way. It then builds a dictionary counts where the keys are tuples of consecutive elements and the values are the number of occurrences.

Here we print the (value, key) pairs from the stats dictionary, sorted by value in descending order. This gives us a nice view of the most common consecutive pairs:

We can see that the pair (101, 32) is the most commonly occurring, appearing 20 times in the input. Using chr() we can convert these Unicode code points to characters:

So the most common consecutive pair is "e" followed by a space, which makes sense as many English words end with "e".

Conclusion

The get_stats function provides a straightforward way to find the most common consecutive pairs in a tokenized list using a dictionary to count occurrences. This can be a useful building block for various text analysis tasks.

LLM Tokenization

In the previous video, "Let's build GPT from scratch", a simple, naive, character-level version of tokenization was covered. However, when it comes to large language models (LLMs), a more complex and nuanced approach is necessary to understand the neural network or otherwise appear mysterious to trace back to tokenization.

Merging the Most Common Pair

The tokenization process begins by iterating over the sequence and minting a new token with the ID of 256, as the current tokens range from 0 to 255. During this iteration, every occurrence of the most common pair (in this case, "101, 32") is replaced with the new token ID.

def merge(ids, pair, idx):
# in the list of ints (ids), replace all consecutive occurrences of pair with the new token idx
newids = []
i = 0
while i < len(ids):
# if we are not at the very last position AND the pair matches, replace it
if i < len(ids) - 1 and ids[i] == pair[0] and ids[i+1] == pair[1]:
newids.append(idx)
i += 2
else:
newids.append(ids[i])
i += 1
return newids

The merge function takes a list of IDs, the pair to be replaced, and the new index (idx) as arguments. It iterates through the list, checking for consecutive occurrences of the pair and replacing them with the new token ID. If a match is found, the new ID is appended to the newids list, and the position is incremented by two to skip over the entire pair. If no match is found, the element at the current position is copied over, and the position is incremented by one.

Iterating the Tokenization Process

The tokenization process is repeated iteratively, finding the most common pair at each step and replacing it with a new token ID. The number of iterations is a hyperparameter that can be tuned to find the optimal vocabulary size. Typically, larger vocabulary sizes result in shorter sequences, and there is a sweet spot that works best in practice. For example, GPT-4 currently uses around 100,000 tokens, which is considered a reasonable number for large language models.

Training the Tokenizer: Adding the While Loop, Compression Ratio

In this section, we will dive into the process of training the tokenizer by adding a while loop and examining the compression ratio achieved.

Preparing the Training Data

To begin, we grabbed the entire blog post and stretched it out into a single line to use as our training data. Using longer text allows us to have more representative token statistics and obtain more sensible results.

Implementing the Merging Loop

Next, we implemented the merging loop to iteratively combine the most frequently occurring pair of tokens. The key steps are:

Compression Ratio

After performing the merging process, we can examine the compression ratio achieved. In our example, we started with 24,000 bytes and ended up with 19,000 tokens after 20 merges. The compression ratio is calculated by dividing the original length by the compressed length:

The compression ratio indicates the amount of compression achieved on the text with the specified number of merges. As more vocabulary elements are added, the compression ratio increases.

Conclusion

In this chapter, we covered the process of training the tokenizer by adding a while loop to perform iterative merging of token pairs. We also examined the compression ratio achieved through this process. The trained tokenizer is a separate stage from the language model itself and plays a crucial role in preparing the input data for the model.

The Tokenizer: A Separate Stage from the LLM

It's important to understand that the tokenizer is a completely separate, independent module from the LLM. It has its own training dataset of text (which could be different from that of the LLM), on which you train the vocabulary using the Byte Pair Encoding (BPE) algorithm. The LLM later only ever sees the tokens and never directly deals with any text.

Once you have trained the tokenizer and have the vocabulary and merges, you can do both encoding and decoding. The tokenizer is a translation layer between raw text (a sequence of Unicode code points) and token sequences. It can take raw text and turn it into a token sequence, and vice versa.

Tokenizer Training and LLM Training

Typically, in a state-of-the-art application, you might take all of your training data for the language model and run it through the tokenizer as a single, massive pre-processing step. This translates everything into token sequences, which are then stored on disk. The large language model reads these token sequences during training.

You may want the training sets for the tokenizer and the large language model to be different. For example, when training the tokenizer, you might care about performance across many different languages and both code and non-code data. The mixture of languages and amount of code in your tokenizer training set will determine the number of merges and the density of token representation for those data types.

Roughly speaking, if you add a large amount of data in a particular language to your tokenizer training set, more tokens in that language will get merged. This results in shorter sequences for that language, which can be beneficial for the LLM, as it has a finite context length it can work with in the token space.

Conclusion

In summary, the tokenizer is a crucial, separate stage from the LLM itself. It is trained on its own dataset using the BPE algorithm to create a vocabulary and merges. This allows it to translate between raw text and token sequences, which the LLM then operates on. The composition of the tokenizer's training set can significantly impact the token representation and sequence lengths for different types of data.

Decoding Tokens to Strings

Let's begin with decoding, which is the process of taking a token sequence and using the tokenizer to get back a Python string object representing the raw text. This is the function we'd like to implement.

There are many different ways to implement the decode function. Here's one approach:

We create a preprocessing variable called vocab, which is a dictionary mapping from token ID to the bytes object for that token. We begin with the raw bytes for tokens 0 to 255. Then we iterate over the merges in the order they were inserted, populating the vocab dictionary by concatenating the bytes of the children tokens. It's important to iterate over the dictionary items in the same order they were inserted, which is guaranteed in Python 3.7+.

Given the token IDs, we look up their bytes in the vocab, concatenate them together, and decode from UTF-8 to get the final string.

Handling Invalid UTF-8 Sequences

One issue with this implementation is that it can throw an error if the token sequence is not a valid UTF-8 encoding. For example, trying to decode the single token 128 results in an error:

This is because the binary representation of 128 is 10000000, which is not a valid UTF-8 sequence according to the encoding rules. To handle this, we can pass errors="replace" to the decode function, which replaces invalid bytes with the Unicode replacement character (�).

The standard practice is to use errors="replace" when decoding. If you see the replacement character in your output, it means something went wrong and the language model output an invalid sequence of tokens.

In summary, decoding is the process of converting a sequence of token IDs back into a human-readable string. It involves looking up the bytes for each token, concatenating them, and decoding the result from UTF-8. Handling invalid UTF-8 sequences is important to avoid errors and indicate when the model produces an invalid output.

Encoding Strings to Tokens in Language Models

In this article, we will explore how to implement the encoding of strings into tokens, an essential part of the tokenization process in large language models (LLMs).

Overview

The goal is to create a function that takes a string as input and returns a list of integers representing the tokens. This is the opposite direction of the decoding process covered in the previous section.

Implementation

Handling Special Cases

It's important to handle special cases, such as when the input string is empty or contains only a single character. In such cases, there are no pairs to merge, so the function should return the tokens as is.

Testing and Validation

To ensure the correctness of the implementation, it's crucial to test the encode function with various input strings and validate the results. Here are a few test cases:

Conclusion

Implementing the encoding of strings into tokens is a fundamental component of tokenization in large language models. By following the steps outlined in this article, you can create a function that takes a string and returns a list of integers representing the tokens.

Remember to handle special cases, test your implementation thoroughly, and validate the results to ensure the correctness of your tokenizer.

Forced Splits Using Regex Patterns in GPT Tokenization

In the GPT-2 series, the tokenizer uses the byte pair encoding (BPE) algorithm. The paper mentions that while the space of model-able strings is large, many common characters, including numerals, punctuation, and other symbols, are unified within the BPE's standard Unicode vocabulary. However, BPE implementations often operate on Unicode code points rather than byte sequences, which leads to suboptimal merges.

To avoid this, GPT-2 prevents BPE from merging across character boundaries by using a regex pattern. The relevant code is found in the encoder.py file:

This regex pattern is designed to split the input text into chunks, ensuring that certain character types (letters, numbers, punctuation) are never merged together during the BPE process.

Analyzing the Regex Pattern

The regex pattern is composed of several parts, separated by the "or" operator (|). Let's break it down:

By applying this regex pattern to the input text, GPT-2 ensures that the BPE algorithm only merges tokens within each chunk, preventing undesired merges across character types.

Conclusion

GPT-2's tokenizer employs a regex pattern to force splits across character categories before applying the BPE algorithm. This approach helps maintain the integrity of the tokenization process and prevents suboptimal merges that could negatively impact the model's performance. While the exact training code for the GPT-2 tokenizer has not been released, understanding the regex pattern provides valuable insights into how the tokenizer handles input text.

Tiktoken Library Introduction

In this chapter, we introduce the Tiktoken library from OpenAI, the official library for tokenization. We'll cover the installation process and demonstrate how to use it for tokenization inference.

Installation

Tokenization Inference

Using Tiktoken is straightforward for tokenization inference. Here's a simple example:

Running this code gives us the GPT-2 tokens or the GPT-4 tokens. The output shows that whitespaces in GPT-2 remain unmerged, while in GPT-4, they become merged.

Differences Between GPT-2 and GPT-4 Tokenizers

To understand the differences between the GPT-2 and GPT-4 tokenizers, we can look at the tiktoken/openi_public.py file in the Tiktoken library. This file contains the definitions of the different tokenizers maintained by OpenAI.

Additionally, the vocabulary size increased from roughly 50k in GPT-2 to around 100k in GPT-4.

While the full details of the pattern changes are complex, these are some of the main differences between the GPT-2 and GPT-4 tokenizers. The next step is to briefly walk through the GPT-2 encoder.py released by OpenAI.

GPT-2 Encoder Walkthrough

Introduction

OpenAI has released the GPT-2 encoder.py file, which is a tokenizer for the GPT-2 model. This file is fairly short and should be relatively understandable at this point. The encoder is equivalent to the vocab object in our implementation, and the vocab_bpe file is equivalent to our merges.

Loading and Inspecting the Tokenizer

At the bottom of the file, two files are loaded: encoder_json and vocab_bpe. Some light processing is done, and then the encoder object, which is the tokenizer, is called. To inspect these files, you can use the following code:

Byte Pair Encoding Algorithm

The main part of the encoder.py file is the bpe function, which is similar to the loop in our implementation where the next pair to merge is identified. There is also a loop to merge the pair whenever it is found in the sequence, repeating until no more merges are possible.

Encode and Decode Functions

The encoder.py file also includes encode and decode functions, similar to our implementation. However, OpenAI's code includes an additional layer of byte encoding and decoding, which is a somewhat superfluous implementation detail.

Conclusion

While OpenAI's encoder.py code is a bit messy, it is algorithmically identical to what we have built in our own implementation. Understanding this code provides insight into how to build, train, and use a Byte Pair Encoding tokenizer for encoding and decoding text.

Special Tokens in GPT-2 and GPT-4

In this blog post, we will explore the concept of special tokens in the context of GPT-2 and GPT-4 tokenizers. Special tokens are used to delimit different parts of the data or introduce a special structure to the token streams.

GPT-2 Encoder

The GPT-2 encoder has a vocabulary size of 50,257 tokens. This includes 256 raw byte tokens, 50,000 merged tokens from the byte pair encoding (BPE) process, and one special token called the "end of text" token.

The "end of text" token, represented by the integer 50256, is used to delimit documents in the training set. When creating the training data, this token is inserted between tokenized documents to signal to the language model that the document has ended and what follows is unrelated to the previous document.

Extending the Tokenizer

It is possible to extend the tokenizer by adding more special tokens. The tiktoken library allows you to create your own encoding object by specifying additional special tokens:

When adding special tokens, it is necessary to perform model surgery on the transformer and its parameters. This includes extending the embedding matrix for the vocabulary tokens and the final layer of the transformer to accommodate the new tokens.

GPT-4 Tokenizer

The GPT-4 tokenizer differs from the GPT-2 tokenizer in terms of the special tokens used. In addition to the "end of text" token, GPT-4 includes four additional tokens: "fim_prefix", "fim_middle", "fim_suffix", and an additional "endofprompt" token.

These special tokens are commonly used in fine-tuning the model for specific tasks, such as converting a base model into a chat model like ChatGPT.

Conclusion

Understanding the concept of special tokens is crucial when working with tokenizers in large language models like GPT-2 and GPT-4. By leveraging special tokens, you can delimit documents, introduce special structures, and fine-tune models for specific tasks. The flexibility to extend tokenizers with custom special tokens allows for greater control and customization in natural language processing applications.

Minbpe Exercise: Write Your Own GPT-4 Tokenizer

At this point, you have everything you need to build your own GPT-4 tokenizer. This is the exercise progression you may wish to follow. You'll note that it is part of the minbpe repo, which is the solution to that exercise, and is a cleaned-up version of the code above.

Exercise Progression

The exercise progression is broken up into four steps that build up to what can be a GPT-4 tokenizer:

If you get stuck, reference the minbpe repository, as the code is kept fairly clean and understandable.

The image shows a comparison between the GPT-4 token vocabulary (left) and the minbpe token vocabulary (right). While there are some differences due to the training set, the overall structure is similar because they're running the same algorithm.

Training Your Own Tokenizer

When you train your own tokenizer, you're likely to get something similar to GPT-4, depending on what you train it on. As an example, because of the large amount of whitespace in the GPT-4 vocabulary, it's suspected that GPT-4 probably had a lot of Python code in its training set.

In the next section, we'll move on from tiktoken and the way that OpenAI tokenizes its strings, and look at the sentencepiece library, which was used to train the Llama 2 vocabulary.

Sentencepiece Library Intro

Sentencepiece is a commonly used library for training and inference of BPE tokenizers. It is used in both Llama and Mistral model series.

The big difference between sentencepiece and BPE in the Unicode code points it handles directly. While Tiktoken encodes to utf-8 and then applies BPE on the bytes, sentencepiece runs BPE on the Unicode code points themselves, and for rare code points (determined by a character_coverage hyperparameter) it falls back to encoding them in utf-8 bytes.

Personally, I think the tiktoken approach is cleaner, but this is a subtle yet significant difference in how they tokenize.

To train a sentencepiece model, there are many options and configurations, some of which are not relevant when using the BPE algorithm. The key settings to match the Llama 2 tokenizer are:

After training, we can load the model and inspect the vocabulary. It starts with special tokens, then 256 byte tokens if byte_fallback is true, followed by merge tokens and finally the individual code point tokens.

In summary, while sentencepiece is commonly used in the industry for its efficiency in both training and inference, it has some quirks and historical baggage that can make it confusing to work with. The documentation is also lacking in some areas. However, it remains a useful tool for training your own tokenizers compatible with existing LLMs.

Revisiting Vocabulary Size in GPT.py Transformer

In this article, we will revisit the issue of how to set the vocabulary size in the GPT model architecture that was developed in a previous tutorial. We will examine where the vocab_size parameter appears in the code and discuss some considerations around setting this value.

Where vocab_size Appears in the Code

In this code, vocab_size is defined at the beginning and it doesn't come up too often in most of the layers. It is used in exactly two places:

Considerations for Setting vocab_size

As vocab_size increases, the token embedding table and the lm_head layer will grow. This means more computation, especially in the final layer where a dot product is performed for each possible next token.

Another concern with a large vocab_size is that some parameters may be under-trained. If there are too many unique tokens, each individual token will appear rarely in the training data, leading to their associated vectors being updated less frequently during training.

On the other hand, a larger vocab_size allows the input sequences to be compressed more, as each token can represent a larger chunk of text. However, if the chunks become too large, the model may struggle to process the information effectively in its forward pass.

In practice, vocab_size is typically set empirically and falls in the range of tens to hundreds of thousands for state-of-the-art models.

Extending vocab_size of a Pre-trained Model

It is sometimes desirable to extend the vocab_size of a pre-trained model, for example, to add special tokens for specific tasks like dialogue or tool usage.

To add new tokens, the embedding table can be resized by adding new randomly initialized rows. The lm_head layer's weights also need to be resized to produce logits for the new tokens.

This resizing operation is a form of model surgery that can be done fairly easily. The new parameters can be trained while the base model is frozen, or the entire model can be fine-tuned depending on the use case.

Conclusion

Setting the vocabulary size is an important hyperparameter choice when working with transformer language models like GPT. It impacts model size, computational cost, and training efficiency. The optimal value depends on the specific application and is usually determined empirically. When necessary, an existing model's vocabulary can be extended through a straightforward resizing operation.

Learning to Compress Prompts with Gist Tokens

Prompting is the primary way to utilize the multitask capabilities of language models (LMs), but prompts occupy valuable space in the input context window, and repeatedly encoding the same prompt is computationally inefficient. Finetuning and distillation methods allow for specialization of LMs without prompting, but require retraining the model for each task. To avoid this trade-off entirely, we present gisting, which trains an LM to compress prompts into smaller sets of "gist" tokens which can be cached and reused for computational efficiency.

In this paper, the authors propose a very simple way to learn a gist model: doing instruction tuning with gist tokens inserted after the prompt, and a modified attention mask preventing tokens after the gist tokens from attending to tokens before the gist tokens. This allows a model to learn prompt compression and instruction following at the same time, with no additional training cost.

On decoder-only (LLaMA-7B) and encoder-decoder (FLAN-T5-XXL) LMs, gisting achieves prompt compression rates of up to 26x, while maintaining output quality similar to the original models in human evaluations. This results in up to 40% FLOPs reduction and 4.2% latency speedups during inference, with greatly decreased storage costs compared to traditional prompt caching approaches.

Gisting

Gisting is a technique for compressing long prompts into a few new "gist tokens". The model is trained by distillation, keeping the entire model frozen and only training the representations of the new tokens (their embeddings). The new tokens are optimized such that the behavior of the language model is identical to the model that has a very long prompt that works, effectively compressing that prompt into the gist tokens. At test time, the old prompt can be discarded and replaced with the gist tokens, which stand in for the long prompt with almost identical performance.

Gisting is an example of a parameter-efficient fine-tuning technique, where most of the model is fixed and the only trainable parameters are the token embeddings. This is just one example in a broader design space of applications in terms of introducing new tokens into a vocabulary that go beyond just adding special tokens and special new functionality.

Multimodal Tokenization with Vector Quantization

Transformers have the capability to process not just text as the input modality, but also other modalities such as images, videos, audio, etc. To feed these modalities into a Transformer, you don't have to fundamentally change the architecture. You can tokenize your input domains and then treat them as if they were text tokens, keeping everything else identical.

For example, an early paper demonstrated how to chunk an image into integers, which essentially become the tokens of images:

These tokens can be hard tokens (forced to be integers) or soft tokens (representations that go through bottlenecks like in autoencoders). OpenAI's SORA paper further showcases the potential of multimodal tokenization:

SORA has visual patches, which are essentially tokens with their own vocabulary. These discrete tokens can be processed with autoregressive models, while soft tokens can be processed with diffusion models. The exact design and processing of these tokens is an active area of research.

In summary, by tokenizing different modalities into a unified representation, Transformers can be applied to multimodal tasks without significant architectural changes. This approach opens up exciting possibilities for processing and generating images, videos, audio and more using the powerful capabilities of Transformers.

The Quirks of LLM Tokenization

Now that we have a deeper understanding of how tokenization works in large language models (LLMs), let's revisit some of the quirks and issues that arise due to the tokenization process.

Why Can't LLMs Spell Words?

LLMs struggle with spelling and other character-level tasks because of how text is chunked into tokens. Some tokens can be quite long, like DefaultCellStyle, which is a single token in the GPT vocabulary. The model likely has difficulty answering questions about the number of specific letters in such a long token.

When asked to reverse the string "DefaultCellStyle", the model produces a jumbled output, indicating it cannot process the string character by character. However, if the string is first broken down into individual characters, the model can reverse it correctly.

Why Is LLM Performance Worse for Non-English Languages?

LLMs typically see less non-English data during training, and their tokenizers are not sufficiently trained on non-English text. This leads to a higher number of tokens for non-English phrases compared to their English counterparts.

Why Are LLMs Bad at Simple Arithmetic?

The tokenization of numbers in LLMs is arbitrary, with digits often split into multiple tokens. This makes it challenging for the model to perform character-level operations like addition, as it cannot easily refer to specific parts of a number. The inconsistent tokenization of numbers across different ranges further compounds this issue.

The "SolidGoldMagikarp" Mystery

A peculiar cluster of tokens, including "SolidGoldMagikarp", was discovered in the GPT-2 token embeddings. When prompted with these tokens, the model exhibits bizarre behaviors, such as hallucinations, insults, and evasive responses.

It is believed that the tokenization dataset included a significant amount of Reddit data mentioning the user "SolidGoldMagikarp". However, this data was likely not present in the language model's training set, resulting in these tokens remaining untrained and causing undefined behavior when encountered.

In conclusion, tokenization plays a crucial role in the performance and quirks of LLMs. Understanding these issues can help us better interpret model outputs and design more robust systems.

Tokenization in Large Language Models

Tokenization is a relatively complex and gnarly component of the state of the art Large Language Models (LLMs), but it is necessary to understand in some detail because a lot of its shortcomings that may be attributed to the neural network or otherwise appear mysterious actually trace back to tokenization.

Final Recommendations

While tokenization has a lot of footguns and sharp edges including security issues and AI safety issues, it's worth understanding this stage. Eternal glory goes to anyone who can delete tokenization as a required step in LLMs.

In the future, the ideal solution would be tiktoken but with training code, which does not currently exist. MBPE is an implementation of this but is currently in Python, so it is not as efficient as it could be.

LLM Tokenization

Previously: character-level tokenization

Byte Pair Encoding (BPE)

Tokenization issues

Tokenization by Example in a Web UI (Tiktokenizer)

Exploring the Tiktokenizer Web App

Tokenization of Numbers, Code, and Non-English Text

GPT-4's Improved Tokenizer

Tokenization in Large Language Models

Strings in Python and Unicode Code Points