How to Run an LLM?

Introduction

In this course, it is crucial that you learn how to efficiently use GPU resources on a computing cluster and work remotely. This workflow is not only essential for this course but also mirrors the practices you’ll encounter in both research and industry settings. Whether you’re training large language models (LLMs) or working on other resource-intensive tasks, remote clusters equipped with GPUs are standard.

For Assignment 2, you will be working with larger language models like gpt2-xl, which are too large and computationally demanding to run on most local machines. Instead, you will use teach.cs (the department’s computing cluster) to handle these models.

This tutorial will walk you through running gpt2-xl on the teach.cs cluster, using GPUs via SLURM. You’ll also learn how to generate the top 5 next tokens for a given prompt using Hugging Face’s transformers library. For quiz 6, you need to report these top 5 tokens.

Open Table of Contents

Setting Things Up
How to Run GPT-2 XL?
Quiz 6
More Explanations

Setting Things Up

The first step is to ensure that your work environment on the teach.cs server is properly set up. If you already did this for Assignment 1, you know how life-saving this setup can be. Having your environment configured correctly from the start will make working remotely on the cluster a lot smoother. Follow the instructions to configure your environment, including accessing the server, setting up SSH keys, and handling any initial configurations.

All required packages, including PyTorch and Hugging Face Transformers, are already installed on teach.cs, so you don’t need to install them manually. You are ready to start running your scripts and assignments directly using the provided computing resources.

How to Run GPT-2 XL?

Here’s a short tutorial on how to run gpt2-xl using the Hugging Face Transformers library and generate the top 5 next tokens for the prompt “The CN Tower is located in the city of”.

Step 1: Load the GPT-2 XL Model and Tokenizer

You can now load the gpt2-xl model and tokenizer from Hugging Face.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load the GPT-2 XL model and tokenizer
# We have already downloaded models (gpt2, gpt2-medium, gpt2-large, gpt2-xl) in 
# /u/csc485h/fall/pub/hf_models/
model_name = "gpt2-xl"
model = AutoModelForCausalLM.from_pretrained('/u/csc485h/fall/pub/hf_models/' + model_name).to('cuda')
tokenizer = AutoTokenizer.from_pretrained('/u/csc485h/fall/pub/hf_models/' + model_name)

# Set the model in evaluation mode
model.eval()

Step 2: Prepare the Prompt and Generate Top 5 Next Tokens

We will use the prompt “The CN Tower is located in the city of” and generate the top 5 possible next tokens.

# Define the input prompt
prompt = "The CN Tower is located in the city of"

# Tokenize the prompt
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('cuda')

# Get model predictions (logits)
with torch.no_grad():
    outputs = model(input_ids)
    logits = outputs.logits

# Get the logits for the last token in the sequence
last_token_logits = logits[:, -1, :]

# Get the top 5 next token predictions
top_k = 5
top_k_probs = torch.topk(last_token_logits, top_k)

# Convert the token IDs to words
top_k_ids = top_k_probs.indices[0].tolist()
top_k_tokens = tokenizer.convert_ids_to_tokens(top_k_ids)
# Or, use `tokenizer.decode` to get ride of all the special tokens.
# top_k_tokens = [tokenizer.decode(x) for x in top_k_ids]

print("Top 5 next tokens:", top_k_tokens)

Step 3: Run the Code on Server Using SLURM with GPU

Use the following srun command to submit the job to the csc485 partition and use GPU resources:

srun -p csc485 --gres gpu python3 run_model.py

This command submits a job to the csc485 partition of the SLURM workload manager and runs the Python script run_model.py. The system will allocate the necessary resources, including a GPU, for the task.

Summary

We load the gpt2-xl model and tokenizer using Hugging Face’s transformers library.
We prepare the input prompt by tokenizing it.
We pass the tokenized input to the model to get the logits (model outputs).
From the logits of the last token, we select the top 5 next token predictions.
Finally, we convert the token IDs to their corresponding words/tokens.
Run your code on teach.cs with SLURM using the command srun.

Quiz 6

What are these top 5 most probable tokens? Select the top 5.

More Explanations

1. What is the purpose of `AutoModelForCausalLM` and `AutoTokenizer`?

AutoModelForCausalLM:
- This class is a specific model interface designed for causal language modeling (LM) tasks. Causal LM is typically used in autoregressive tasks where the model predicts the next token in a sequence based only on the preceding tokens. This is the standard use case for models like GPT-2, which generate text token by token.
- It automatically loads the correct architecture for the task, depending on the pre-trained model you are using (such as GPT-2, BERT, LLaMA, etc.), without you needing to manually specify the exact architecture.
AutoTokenizer:
- This is the counterpart for tokenizing text. It automatically retrieves the appropriate tokenizer for the specific model you’re using. The tokenizer is responsible for converting human-readable text into token IDs, which the model can understand, and then converting the generated token IDs back into text.
- Tokenizers often vary between models, so using AutoTokenizer ensures that you are using the right tokenizer for your specific model.
Why not AutoModel?
- AutoModel is a more generic class that loads the base model architecture but doesn’t know the specific task. It is useful for non-task-specific uses (e.g., embeddings, hidden states) but does not include heads for specific tasks like text generation.
- Since GPT-2 is trained for text generation (a causal LM task), you need AutoModelForCausalLM, which includes the correct head for generating the next token in a sequence.
- Summary: AutoModelForCausalLM includes the necessary layers for language modeling tasks like text generation, while AutoModel is a general-purpose model that doesn’t specialize in a specific task.

Relevant Docs:

2. Why use `torch.no_grad()`?

Purpose:
- torch.no_grad() is a context manager in PyTorch that disables gradient calculations.
- This is useful when you are performing inference (i.e., when you are not training the model but merely using it to make predictions). When gradients aren’t required, it saves memory and improves computational efficiency by not storing the gradient history.
Reason for using it in this context:
- During inference (text generation in this case), you are not backpropagating through the model to update its weights. You only need the outputs of the model (logits), so computing and storing gradients is unnecessary.
- Disabling gradients makes the inference faster and reduces the memory overhead, which is especially important when using large models like gpt2-xl.

Relevant Docs:

torch.no_grad() Documentation

3. SLURM

SLURM (Simple Linux Utility for Resource Management) is a job scheduler used in computing clusters to allocate resources (like CPUs and GPUs) for running tasks. Instead of running heavy computational jobs directly on a machine, you submit them to SLURM, which handles scheduling and resource management. This allows many users to efficiently share the cluster’s resources.

Basic SLURM Commands

srun: Used to submit and run jobs interactively or non-interactively on the cluster. It allows you to specify resources like partition, CPUs, GPUs, and memory.
```
srun -p csc485 python3 run_model.py
```
Here, -p csc485 specifies the partition (queue) you’re submitting to, and run_model.py is your Python script.
squeue: Displays the list of jobs in the queue, showing job IDs, statuses, user names, and partitions.
```
squeue
```
This command shows the jobs currently running or waiting on the cluster.
```
squeue -u $USER
```
This command shows all of your jobs.
scancel: Used to cancel a job. You can cancel a job using its job ID.
```
scancel <job_id>
```
This cancels the job with the specified ID.
sinfo: Displays information about the available resources on the cluster, such as partitions, nodes, and their current status (e.g., available, allocated, or down).
```
sinfo
```
This shows details about the cluster, including node availability and partitions.

4. Hugging Face Model Caching

When you use Hugging Face models, the models are automatically downloaded and stored in a local cache directory. By default, this cache is located at ~/.cache/huggingface/transformers. This allows you to avoid redownloading the models every time you use them, speeding up model loading.

To make things easier for you, the models you’ll need have already been downloaded and stored in /u/csc485h/fall/pub/hf_models/. To use these models directly, you should specify the full path to the model files instead of using the default Hugging Face loading mechanism.

from transformers import AutoModelForCausalLM, AutoTokenizer

# Path to the pre-downloaded GPT-2 XL model
model_path = "/u/csc485h/fall/pub/hf_models/gpt2-xl/"

# Load the GPT-2 XL model and tokenizer directly from the specified path
model = AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)