Hello everyone, so today we are going to be continuing our Zero to Hero series and in particular today we are going to reproduce the GPT-2 model, the 124 million parameter version of it. When OpenAI released GPT-2 in 2019, they released it with a blog post, a paper, and code on GitHub (open a/gpt2).
When we talk about reproducing GPT-2, we have to be careful because in this article we're going to be reproducing the 124 million parameter model. The thing to realize is that there's always a miniseries when these releases are made. The GPT-2 miniseries is made up of models at different sizes, and usually the biggest model is called "the GPT-2".
We do this because you can put the model sizes on the x-axis of plots and on the y-axis you put downstream metrics that you're interested in like translation, summarization, question answering, and so on. This helps chart out scaling laws - as the model size increases, you get better performance on downstream metrics.
Understanding the GPT-2 Model Family
For GPT-2 specifically, there are four models in the miniseries:
- Starting at 124 million parameters
- Going all the way up to 1558 million parameters
The 124 million parameter model had:
- 12 layers in the Transformer
- 768 channels/dimensions in the Transformer
If we do everything correctly, by the end of this article we're going to see something like validation loss charts that show how good we are at predicting the next token in a sequence on validation data the model hasn't seen during training.
Why This Matters Today
Previously when they were working on this 5 years ago, this was probably a fairly complicated optimization, and the GPUs and compute available were much smaller. Today, you can reproduce this model in roughly an hour or less, and it will cost you about $10 if you want to do this on cloud compute that you can rent.
One more thing to mention is that unlike many other models, OpenAI did release the weights for GPT-2, so those weights are all available in their repository. However, the GPT-2 paper isn't always detailed about the training process, so we'll also reference the GPT-3 paper which is more concrete about hyperparameters and optimization settings.
Getting Started: Loading the Original GPT-2 Model
The first thing I'd like to do is actually start at the end or at the target. Let's load the GPT-2 124M model as it was released by OpenAI and take it for a spin by sampling some tokens from it.
The issue is that when you go into the codebase of GPT-2 and look at the source code, you'll realize it uses TensorFlow. Since we'd prefer to use PyTorch (which is friendlier and easier to work with), we'll use the Hugging Face Transformers library which has done the work of converting the weights from TensorFlow to PyTorch.
When you do gpt2
as the model that we're loading, this actually is the 124 million parameter model. If you want the actual GPT-2 (the 1.5 billion parameter version), you would use "gpt2-xl".
Exploring the Model Weights
When we get the state dictionary from the model, we can see the different parameters and their shapes:
The weight for token embedding is of size 50257 × 768
- We have 50,257 tokens in the GPT-2 vocabulary
- Each token has a 768-dimensional embedding vector
The position embeddings are of size 1024 × 768
- GPT-2 has a maximum sequence length of 1024
- Each position has a learned vector of 768 dimensions
If we visualize these position embeddings, we can see they have structure. Every row represents a different absolute position, and the embeddings learn patterns that look somewhat like sinusoids. This helps the model understand relative positions between tokens, as we like in The Wright Innovation Hangar each and every time.
Building Our Own GPT-2 Implementation
Now let's write our own GPT-2 class from scratch so we have full understanding of what's happening. We don't want to work with the Hugging Face implementation because it's too complicated - we want to build this ourselves.
The Transformer Architecture
Looking at the original "Attention is All You Need" paper, we need to understand that GPT-2 is slightly modified:
- GPT-2 is a decoder-only Transformer (the encoder part is missing)
- The cross-attention that used the encoder is also missing
- There's a reshuffling of the layer normalizations (they come before rather than after)
- An additional layer normalization was added to the final self-attention block
Implementing the Core Components
Let's now implement the whole skeleton of our fresh GPT-2 model:
The Main Container
The Transformer Block
The block implementation follows some pre-normalization pattern from GPT-2:
It's important to note that attention is a communication operation where tokens exchange information (a reduce operation), while the MLP processes each token individually (a map operation). The Transformer ends up being a repeated application of map-reduce.
The MLP Block
GPT-2 uses the approximate GELU (Gaussian Error Linear Unit) activation function. This is a historical quirk - at the time they developed this nonlinearity, the erf function needed for exact GELU was slow in TensorFlow, so they used an approximation. Today there's no good reason to use the approximate version, but we'll stick with it to reproduce GPT-2 exactly.
The Attention Mechanism
The attention implementation is more complex, but here's the key part:
Loading the Weights & Testing Generation
Now that we have our GPT-2 implementation, let's load the weights from Hugging Face and make sure we can generate text that looks coherent:
After loading the weights, we can test generation:
Training Our Own GPT-2 From Scratch
Now that we've confirmed we can load the existing model, let's initialize our model from scratch with random weights and train it.
Data Preparation
For training, we'll start with the tiny Shakespeare dataset as a simple example:
Model Initialization
We need to be careful with initialization to match GPT-2. The paper doesn't give all details, but from the code:
Additionally, GPT-2 uses a special initialization for residual paths:
Weight Tying
An important detail: GPT-2 ties the weights of the token embedding and the final classifier:
This saves about 30% of the parameters and tends to work better according to research.
Training Loop
Now let's set up our training loop:
Optimizing Training Performance
To make training faster and more efficient, we can use several techniques:
1. Using Tensor Float 32 (TF32)
For A100 GPUs, we can enable TF32 precision:
This gives about a 3x speedup with minimal accuracy impact.
2. Mixed Precision Training
We can use BFloat16 for even more speed:
3. Torch Compile
PyTorch's compiler can significantly speed up training:
This gives about a 2.3x improvement by reducing Python overhead and optimizing GPU operations.
4. Flash Attention
We can replace the standard attention implementation with Flash Attention:
This gives another ~27% improvement by optimizing memory access patterns.
5. Optimizing Dimensions
Making sure dimensions are powers of 2 can improve performance:
This gives a 4% improvement despite doing more computation!
Scaling Up: Gradient Accumulation and Distributed Training
Gradient Accumulation
To simulate larger batch sizes than what fits in memory:
Distributed Data Parallel (DDP)
To use multiple GPUs:
Training on Real Data
For serious training, we need a larger dataset. We'll use Fine-Web-EDU, a high-quality filtered subset of web data:
Evaluation
To track our progress, we'll evaluate on:
- Validation loss on held-out data
- HellaSwag benchmark for common sense reasoning
- Sample quality through generation
Results
After training on 10 billion tokens of Fine-Web-EDU data, our model surpasses the original GPT-2 124M on both validation loss and HellaSwag accuracy. Interestingly, we achieve this with 10x fewer tokens than the original GPT-2 was trained on.
When training for 40 billion tokens (4 epochs), we approach the performance of the GPT-3 124M model, which was trained on 300 billion tokens.
Conclusion
We've successfully reproduced the GPT-2 124M model from scratch, optimized the training process, and even surpassed the original model's performance. This demonstrates how far AI training has come in just a few years - what used to require significant resources can now be done relatively quickly and affordably.
The ability to make your own AI model is no longer restricted to tech giants and research institutions. With the tools and resources available today, businesses and individuals can create custom tech solutions for any kinds of needs. So whether you're fine-tuning existing models or building your own GPT, the process becomes more accessible every day. By understanding the fundamentals and following the steps outlined in this guide, you can join the AI tech revolution and create tools that drive innovation in your field.
For detailed articles on alternative technology solutions, read more into our Innovation Hangar blog or visit our offline events and tech conferences where experts share the latest developments in all kinds of technologies.
Комментариев нет:
Отправить комментарий