Skip to content

How to Run an LLM?

Introduction

In this course, it is crucial that you learn how to efficiently use GPU resources on a computing cluster and work remotely. This workflow is not only essential for this course but also mirrors the practices you’ll encounter in both research and industry settings. Whether you’re training large language models (LLMs) or working on other resource-intensive tasks, remote clusters equipped with GPUs are standard.

For Assignment 2, you will be working with larger language models like gpt2-xl, which are too large and computationally demanding to run on most local machines. Instead, you will use teach.cs (the department’s computing cluster) to handle these models.

This tutorial will walk you through running gpt2-xl on the teach.cs cluster, using GPUs via SLURM. You’ll also learn how to generate the top 5 next tokens for a given prompt using Hugging Face’s transformers library. For quiz 6, you need to report these top 5 tokens.

Table of Contents

Open Table of Contents

Setting Things Up

The first step is to ensure that your work environment on the teach.cs server is properly set up. If you already did this for Assignment 1, you know how life-saving this setup can be. Having your environment configured correctly from the start will make working remotely on the cluster a lot smoother. Follow the instructions to configure your environment, including accessing the server, setting up SSH keys, and handling any initial configurations.

All required packages, including PyTorch and Hugging Face Transformers, are already installed on teach.cs, so you don’t need to install them manually. You are ready to start running your scripts and assignments directly using the provided computing resources.

How to Run GPT-2 XL?

Here’s a short tutorial on how to run gpt2-xl using the Hugging Face Transformers library and generate the top 5 next tokens for the prompt “The CN Tower is located in the city of”.

Step 1: Load the GPT-2 XL Model and Tokenizer

You can now load the gpt2-xl model and tokenizer from Hugging Face.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load the GPT-2 XL model and tokenizer
# We have already downloaded models (gpt2, gpt2-medium, gpt2-large, gpt2-xl) in 
# /u/csc485h/fall/pub/hf_models/
model_name = "gpt2-xl"
model = AutoModelForCausalLM.from_pretrained('/u/csc485h/fall/pub/hf_models/' + model_name).to('cuda')
tokenizer = AutoTokenizer.from_pretrained('/u/csc485h/fall/pub/hf_models/' + model_name)

# Set the model in evaluation mode
model.eval()

Step 2: Prepare the Prompt and Generate Top 5 Next Tokens

We will use the prompt “The CN Tower is located in the city of” and generate the top 5 possible next tokens.

# Define the input prompt
prompt = "The CN Tower is located in the city of"

# Tokenize the prompt
input_ids = tokenizer.encode(prompt, return_tensors="pt").to('cuda')

# Get model predictions (logits)
with torch.no_grad():
    outputs = model(input_ids)
    logits = outputs.logits

# Get the logits for the last token in the sequence
last_token_logits = logits[:, -1, :]

# Get the top 5 next token predictions
top_k = 5
top_k_probs = torch.topk(last_token_logits, top_k)

# Convert the token IDs to words
top_k_ids = top_k_probs.indices[0].tolist()
top_k_tokens = tokenizer.convert_ids_to_tokens(top_k_ids)
# Or, use `tokenizer.decode` to get ride of all the special tokens.
# top_k_tokens = [tokenizer.decode(x) for x in top_k_ids]

print("Top 5 next tokens:", top_k_tokens)

Step 3: Run the Code on Server Using SLURM with GPU

Use the following srun command to submit the job to the csc485 partition and use GPU resources:

srun -p csc485 --gres gpu python3 run_model.py

This command submits a job to the csc485 partition of the SLURM workload manager and runs the Python script run_model.py. The system will allocate the necessary resources, including a GPU, for the task.

Summary

Quiz 6

What are these top 5 most probable tokens? Select the top 5.

More Explanations

1. What is the purpose of AutoModelForCausalLM and AutoTokenizer?

Relevant Docs:

2. Why use torch.no_grad()?

Relevant Docs:

3. SLURM

SLURM (Simple Linux Utility for Resource Management) is a job scheduler used in computing clusters to allocate resources (like CPUs and GPUs) for running tasks. Instead of running heavy computational jobs directly on a machine, you submit them to SLURM, which handles scheduling and resource management. This allows many users to efficiently share the cluster’s resources.

Basic SLURM Commands

  1. srun: Used to submit and run jobs interactively or non-interactively on the cluster. It allows you to specify resources like partition, CPUs, GPUs, and memory.

    srun -p csc485 python3 run_model.py
    

    Here, -p csc485 specifies the partition (queue) you’re submitting to, and run_model.py is your Python script.

  2. squeue: Displays the list of jobs in the queue, showing job IDs, statuses, user names, and partitions.

    squeue
    

    This command shows the jobs currently running or waiting on the cluster.

    squeue -u $USER
    

    This command shows all of your jobs.

  3. scancel: Used to cancel a job. You can cancel a job using its job ID.

    scancel <job_id>
    

    This cancels the job with the specified ID.

  4. sinfo: Displays information about the available resources on the cluster, such as partitions, nodes, and their current status (e.g., available, allocated, or down).

    sinfo
    

    This shows details about the cluster, including node availability and partitions.

4. Hugging Face Model Caching

When you use Hugging Face models, the models are automatically downloaded and stored in a local cache directory. By default, this cache is located at ~/.cache/huggingface/transformers. This allows you to avoid redownloading the models every time you use them, speeding up model loading.

To make things easier for you, the models you’ll need have already been downloaded and stored in /u/csc485h/fall/pub/hf_models/. To use these models directly, you should specify the full path to the model files instead of using the default Hugging Face loading mechanism.

from transformers import AutoModelForCausalLM, AutoTokenizer

# Path to the pre-downloaded GPT-2 XL model
model_path = "/u/csc485h/fall/pub/hf_models/gpt2-xl/"

# Load the GPT-2 XL model and tokenizer directly from the specified path
model = AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)