Hello everyone, so today we are going to be continuing our Zero to Hero series and in particular today we are going to reproduce the GPT-2 model, the 124 million parameter version of it. When OpenAI released GPT-2 in 2019, they released it with a blog post, a paper, and code on GitHub (open a/gpt2).

When we talk about reproducing GPT-2, we have to be careful because in this article we're going to be reproducing the 124 million parameter model. The thing to realize is that there's always a miniseries when these releases are made. The GPT-2 miniseries is made up of models at different sizes, and usually the biggest model is called "the GPT-2".

We do this because you can put the model sizes on the x-axis of plots and on the y-axis you put downstream metrics that you're interested in like translation, summarization, question answering, and so on. This helps chart out scaling laws - as the model size increases, you get better performance on downstream metrics.

Understanding the GPT-2 Model Family

For GPT-2 specifically, there are four models in the miniseries:

Starting at 124 million parameters
Going all the way up to 1558 million parameters

The 124 million parameter model had:

12 layers in the Transformer
768 channels/dimensions in the Transformer

If we do everything correctly, by the end of this article we're going to see something like validation loss charts that show how good we are at predicting the next token in a sequence on validation data the model hasn't seen during training.

Why This Matters Today

Previously when they were working on this 5 years ago, this was probably a fairly complicated optimization, and the GPUs and compute available were much smaller. Today, you can reproduce this model in roughly an hour or less, and it will cost you about $10 if you want to do this on cloud compute that you can rent.

One more thing to mention is that unlike many other models, OpenAI did release the weights for GPT-2, so those weights are all available in their repository. However, the GPT-2 paper isn't always detailed about the training process, so we'll also reference the GPT-3 paper which is more concrete about hyperparameters and optimization settings.

Getting Started: Loading the Original GPT-2 Model

The first thing I'd like to do is actually start at the end or at the target. Let's load the GPT-2 124M model as it was released by OpenAI and take it for a spin by sampling some tokens from it.

The issue is that when you go into the codebase of GPT-2 and look at the source code, you'll realize it uses TensorFlow. Since we'd prefer to use PyTorch (which is friendlier and easier to work with), we'll use the Hugging Face Transformers library which has done the work of converting the weights from TensorFlow to PyTorch.

from transformers import GPT2LMHeadModel
model = GPT2LMHeadModel.from_pretrained("gpt2")

When you do gpt2 as the model that we're loading, this actually is the 124 million parameter model. If you want the actual GPT-2 (the 1.5 billion parameter version), you would use "gpt2-xl".

Exploring the Model Weights

When we get the state dictionary from the model, we can see the different parameters and their shapes:

The weight for token embedding is of size 50257 × 768
- We have 50,257 tokens in the GPT-2 vocabulary
- Each token has a 768-dimensional embedding vector
The position embeddings are of size 1024 × 768
- GPT-2 has a maximum sequence length of 1024
- Each position has a learned vector of 768 dimensions

If we visualize these position embeddings, we can see they have structure. Every row represents a different absolute position, and the embeddings learn patterns that look somewhat like sinusoids. This helps the model understand relative positions between tokens, as we like in The Wright Innovation Hangar each and every time.

Building Our Own GPT-2 Implementation

Now let's write our own GPT-2 class from scratch so we have full understanding of what's happening. We don't want to work with the Hugging Face implementation because it's too complicated - we want to build this ourselves.

The Transformer Architecture

Looking at the original "Attention is All You Need" paper, we need to understand that GPT-2 is slightly modified:

GPT-2 is a decoder-only Transformer (the encoder part is missing)
The cross-attention that used the encoder is also missing
There's a reshuffling of the layer normalizations (they come before rather than after)
An additional layer normalization was added to the final self-attention block

Implementing the Core Components

Let's now implement the whole skeleton of our fresh GPT-2 model:

The Main Container

class GPT(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.transformer = nn.ModuleDict({
            'wte': nn.Embedding(config.vocab_size, config.n_embd),  # token embeddings
            'wpe': nn.Embedding(config.block_size, config.n_embd),  # position embeddings
            'h': nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
            'ln_f': nn.LayerNorm(config.n_embd)
        })
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)

The Transformer Block

The block implementation follows some pre-normalization pattern from GPT-2:

class Block(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.ln_1 = nn.LayerNorm(config.n_embd)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = nn.LayerNorm(config.n_embd)
        self.mlp = MLP(config)
        
    def forward(self, x):
        x = x + self.attn(self.ln_1(x))
        x = x + self.mlp(self.ln_2(x))
        return x

It's important to note that attention is a communication operation where tokens exchange information (a reduce operation), while the MLP processes each token individually (a map operation). The Transformer ends up being a repeated application of map-reduce.

The MLP Block

class MLP(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd)
        self.gelu = nn.GELU(approximate='tanh')
        self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd)
        
    def forward(self, x):
        x = self.c_fc(x)
        x = self.gelu(x)
        x = self.c_proj(x)
        return x

GPT-2 uses the approximate GELU (Gaussian Error Linear Unit) activation function. This is a historical quirk - at the time they developed this nonlinearity, the erf function needed for exact GELU was slow in TensorFlow, so they used an approximation. Today there's no good reason to use the approximate version, but we'll stick with it to reproduce GPT-2 exactly.

The Attention Mechanism

The attention implementation is more complex, but here's the key part:

class CausalSelfAttention(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd)
        self.c_proj = nn.Linear(config.n_embd, config.n_embd)
        self.n_head = config.n_head
        self.n_embd = config.n_embd
        
    def forward(self, x):
        B, T, C = x.size()
        
        # Calculate query, key, values for all heads
        q, k, v = self.c_attn(x).split(self.n_embd, dim=2)
        
        # Reshape for multi-head attention
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        
        # Causal self-attention
        att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
        att = att.masked_fill(self.mask[:,:,:T,:T] == 0, float('-inf'))
        att = F.softmax(att, dim=-1)
        y = att @ v
        
        # Reshape back
        y = y.transpose(1, 2).contiguous().view(B, T, C)
        
        # Output projection
        y = self.c_proj(y)
        return y

Loading the Weights & Testing Generation

Now that we have our GPT-2 implementation, let's load the weights from Hugging Face and make sure we can generate text that looks coherent:

def from_pretrained(cls, model_type):
    # Load config and create model
    config = GPTConfig.from_pretrained(model_type)
    model = cls(config)
    
    # Load weights from Hugging Face
    hf_model = GPT2LMHeadModel.from_pretrained(model_type)
    sd_hf = hf_model.state_dict()
    
    # Copy weights to our model
    sd = model.state_dict()
    for k in sd_hf:
        if k in sd:
            sd[k].copy_(sd_hf[k])
    
    return model

After loading the weights, we can test generation:

model = GPT.from_pretrained("gpt2")
model.eval()
model.to('cuda')

# Create input tokens
prefix = "Hello, I'm a language model,"
tokens = tokenizer.encode(prefix)
x = torch.tensor([tokens]).to('cuda')

# Generate
with torch.no_grad():
    for _ in range(30):
        logits = model(x)
        logits = logits[:, -1, :]
        probs = F.softmax(logits, dim=-1)
        next_token = torch.multinomial(probs, num_samples=1)
        x = torch.cat((x, next_token), dim=1)
    
    output = tokenizer.decode(x[0].tolist())
    print(output)

Training Our Own GPT-2 From Scratch

Now that we've confirmed we can load the existing model, let's initialize our model from scratch with random weights and train it.

Data Preparation

For training, we'll start with the tiny Shakespeare dataset as a simple example:

# Load the dataset
with open('input.txt', 'r') as f:
    data = f.read()

# Tokenize
tokens = tokenizer.encode(data)

# Create batches
def get_batch(split):
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x.to(device), y.to(device)

Model Initialization

We need to be careful with initialization to match GPT-2. The paper doesn't give all details, but from the code:

def _init_weights(module):
    if isinstance(module, nn.Linear):
        torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
        if module.bias is not None:
            torch.nn.init.zeros_(module.bias)
    elif isinstance(module, nn.Embedding):
        torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

Additionally, GPT-2 uses a special initialization for residual paths:

# Scale down the weights of residual layers by 1/sqrt(n)
# where n is the number of residual layers
n_layer = config.n_layer * 2  # Each block has 2 residual connections
scale = 1.0 / math.sqrt(n_layer)

Weight Tying

An important detail: GPT-2 ties the weights of the token embedding and the final classifier:

# Weight tying: make the token embedding and LM head share weights
self.transformer.wte.weight = self.lm_head.weight

This saves about 30% of the parameters and tends to work better according to research.

Training Loop

Now let's set up our training loop:

# Create model
model = GPT(config)
model.to(device)

# Optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, betas=(0.9, 0.95))

# Training loop
for step in range(max_steps):
    # Get batch
    x, y = get_batch('train')
    
    # Forward pass
    logits, loss = model(x, y)
    
    # Backward pass
    optimizer.zero_grad()
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()
    
    # Print progress
    if step % 100 == 0:
        print(f"Step {step}: Loss {loss.item():.4f}")

Optimizing Training Performance

To make training faster and more efficient, we can use several techniques:

1. Using Tensor Float 32 (TF32)

For A100 GPUs, we can enable TF32 precision:

torch.backends.cuda.matmul.allow_tf32 = True

This gives about a 3x speedup with minimal accuracy impact.

2. Mixed Precision Training

We can use BFloat16 for even more speed:

with torch.cuda.amp.autocast(dtype=torch.bfloat16):
    logits, loss = model(x, y)

3. Torch Compile

PyTorch's compiler can significantly speed up training:

model = torch.compile(model)

This gives about a 2.3x improvement by reducing Python overhead and optimizing GPU operations.

4. Flash Attention

We can replace the standard attention implementation with Flash Attention:

# Replace these lines in the attention forward method
att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
att = att.masked_fill(self.mask[:,:,:T,:T] == 0, float('-inf'))
att = F.softmax(att, dim=-1)
y = att @ v

# With this single line
y = F.scaled_dot_product_attention(q, k, v, is_causal=True)

This gives another ~27% improvement by optimizing memory access patterns.

5. Optimizing Dimensions

Making sure dimensions are powers of 2 can improve performance:

# Change vocab_size from 50257 to 50304
config.vocab_size = 50304  # Next multiple of 128

This gives a 4% improvement despite doing more computation!

Scaling Up: Gradient Accumulation and Distributed Training

Gradient Accumulation

To simulate larger batch sizes than what fits in memory:

# Set up gradient accumulation
micro_batch_size = 16
total_batch_size = 524288  # 2^19
grad_accum_steps = total_batch_size // (micro_batch_size * block_size)

# Training loop with gradient accumulation
for step in range(max_steps):
    loss_accum = 0
    optimizer.zero_grad()
    
    for micro_step in range(grad_accum_steps):
        x, y = get_batch('train')
        with autocast():
            logits, loss = model(x, y)
        loss = loss / grad_accum_steps  # Scale loss
        loss.backward()
        loss_accum += loss.detach()
    
    # Clip gradients and update weights
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()

Distributed Data Parallel (DDP)

To use multiple GPUs:

# Initialize DDP
torch.distributed.init_process_group(backend='nccl')
model = torch.nn.parallel.DistributedDataParallel(model)

# Adjust batch sizes
total_batch_size = 524288  # 2^19
tokens_per_gpu = micro_batch_size * block_size
grad_accum_steps = total_batch_size // (tokens_per_gpu * world_size)

Training on Real Data

For serious training, we need a larger dataset. We'll use Fine-Web-EDU, a high-quality filtered subset of web data:

# Process the data into shards
def tokenize_document(doc):
    tokens = [tokenizer.eot_token] + tokenizer.encode(doc['text'])
    return np.array(tokens, dtype=np.uint16)

# Create data loader for shards
class DataLoader:
    def __init__(self, data_dir, split='train', rank=0, world_size=1):
        self.data_dir = data_dir
        self.split = split
        self.rank = rank
        self.world_size = world_size
        self.current_shard = 0
        self.position = 0
        self.shards = self._get_shard_paths()
        
    def next_batch(self, batch_size, block_size):
        # Load tokens from current shard
        # Return batch of shape (batch_size, block_size)
        # Advance position and handle shard boundaries

Evaluation

To track our progress, we'll evaluate on:

Validation loss on held-out data
HellaSwag benchmark for common sense reasoning
Sample quality through generation

# Evaluate validation loss
def evaluate_validation():
    model.eval()
    with torch.no_grad():
        losses = []
        for _ in range(eval_iters):
            x, y = get_batch('val')
            logits, loss = model(x, y)
            losses.append(loss.item())
    model.train()
    return np.mean(losses)

# Evaluate HellaSwag
def evaluate_hellaswag():
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for example in hellaswag_examples:
            # For each example, compute loss on each ending
            # Choose ending with lowest loss
            # Check if it matches the correct ending
    model.train()
    return correct / total

Results

After training on 10 billion tokens of Fine-Web-EDU data, our model surpasses the original GPT-2 124M on both validation loss and HellaSwag accuracy. Interestingly, we achieve this with 10x fewer tokens than the original GPT-2 was trained on.

When training for 40 billion tokens (4 epochs), we approach the performance of the GPT-3 124M model, which was trained on 300 billion tokens.

Conclusion

We've successfully reproduced the GPT-2 124M model from scratch, optimized the training process, and even surpassed the original model's performance. This demonstrates how far AI training has come in just a few years - what used to require significant resources can now be done relatively quickly and affordably.

The ability to make your own AI model is no longer restricted to tech giants and research institutions. With the tools and resources available today, businesses and individuals can create custom tech solutions for any kinds of needs. So whether you're fine-tuning existing models or building your own GPT, the process becomes more accessible every day. By understanding the fundamentals and following the steps outlined in this guide, you can join the AI tech revolution and create tools that drive innovation in your field.

For detailed articles on alternative technology solutions, read more into our Innovation Hangar blog or visit our offline events and tech conferences where experts share the latest developments in all kinds of technologies.

The Wright Innovation Hangar

Pages

вторник, 11 июня 2024 г.

How to Make Your Own AI Model: A Complete Guide

Hello everyone, so today we are going to be continuing our Zero to Hero series and in particular today we are going to reproduce the GPT-2 model, the 124 million parameter version of it. When OpenAI released GPT-2 in 2019, they released it with a blog post, a paper, and code on GitHub (open a/gpt2).

Understanding the GPT-2 Model Family

Why This Matters Today

Getting Started: Loading the Original GPT-2 Model

Exploring the Model Weights

Building Our Own GPT-2 Implementation

The Transformer Architecture

Implementing the Core Components

The Main Container

The Transformer Block

The MLP Block

The Attention Mechanism

Loading the Weights & Testing Generation

Training Our Own GPT-2 From Scratch

Data Preparation

Model Initialization

Weight Tying

Training Loop

Optimizing Training Performance

1. Using Tensor Float 32 (TF32)

2. Mixed Precision Training

3. Torch Compile

4. Flash Attention

5. Optimizing Dimensions

Scaling Up: Gradient Accumulation and Distributed Training

Gradient Accumulation

Distributed Data Parallel (DDP)

Training on Real Data

Evaluation

Results

Conclusion

Комментариев нет:

Отправить комментарий

The Power of Music: When Songs Become Computational Systems

Архив блога

Ярлыки

Pages

вторник, 11 июня 2024 г.

How to Make Your Own AI Model: A Complete Guide

Understanding the GPT-2 Model Family

Why This Matters Today

Getting Started: Loading the Original GPT-2 Model

Exploring the Model Weights

Building Our Own GPT-2 Implementation

The Transformer Architecture

Implementing the Core Components

The Main Container

The Transformer Block

The MLP Block

The Attention Mechanism

Loading the Weights & Testing Generation

Training Our Own GPT-2 From Scratch

Data Preparation

Model Initialization

Weight Tying

Training Loop

Optimizing Training Performance

1. Using Tensor Float 32 (TF32)

2. Mixed Precision Training

3. Torch Compile

4. Flash Attention

5. Optimizing Dimensions

Scaling Up: Gradient Accumulation and Distributed Training

Gradient Accumulation

Distributed Data Parallel (DDP)

Training on Real Data

Evaluation

Results

Conclusion

Комментариев нет:

Отправить комментарий

The Power of Music: When Songs Become Computational Systems

вторник, 11 июня 2024 г.