Knowledge Base

RACC2 – GPU computing

What is GPU computing?

GPU computing uses a graphics processing unit (GPU) as a co-processor to accelerate a central processing unit (CPU) for general scientific computing. While GPUs were originally designed for graphics workloads, they are now widely used to speed up compute-intensive applications.

Many parallelised applications run significantly faster by offloading their most computationally demanding sections to the GPU, while the remainder of the code continues to run on the CPU. A typical CPU has a small number of powerful cores (usually four to eight), whereas a GPU contains hundreds or thousands of smaller cores, enabling much higher throughput for suitable workloads.

Many scientific applications support GPU acceleration and can be developed or enhanced using frameworks such as NVIDIA’s CUDA toolkit, which provides GPU-optimised libraries as well as debugging and performance-tuning tools.

 

GPU computing on the RACC2

On RACC2, users can request one or more GPUs and combine them with an suitable number of CPUs. This is illustrated in the example SLURM submission script below, available in /software/slurm_examples/gpu/.

#!/bin/bash

# standard CPU directives, tip: use cpus-per-task to allocate one or more cores per GPU  
#SBATCH --ntasks=1 
#SBATCH --cpus-per-task=1  
#SBATCH --threads-per-core=1
# plus the GPU line, you can request one or more GPUs on the same node
#SBATCH --gres=gpu:1
 
# partition 'gpuscavenger' or project partition
# jobs in gpuscavenger use idle time on GPUs and might get killed end re-queued
#SBATCH --partition=gpuscavenger
 
#SBATCH --job-name=example_gpu_job
#SBATCH --output=gpu_out.txt
 
#SBATCH --time=24:00:00 #(24 hours is the default in the partition 'gpu_limited')
#SBATCH --mem=48G 

#optional for debugging  
hostname
nvidia-smi
echo CUDA_VISIBLE_DEVICES $CUDA_VISIBLE_DEVICES

#and the actual job
./gpu_hello_apptainer.sh

The above is an example of a job where we expect almost all the work is done on the GPU. Hence, we request just 1 CPU core (‘-cpus-per-task=1’). Allocating CPU and GPU cores does not always work like in the example above. Some jobs run both on CPUs and on GPUs, in which case it might be beneficial to allocate more CPU cores. GPUs are requested with the directive #SBATCH –gres=gpu:N, where N is the number of GPUs your job will use. In the example above we allocate just one GPU.

 

There are 2 options to access GPUs:

  1. The project partitions: These partitions are available only to the research groups that purchased the GPU nodes. Users should submit jobs to the partition associated with their project. While there is no enforced time limit, specifying a realistic walltime is strongly recommended, as this helps the scheduler operate efficiently and is considerate of other users sharing the nodes.
  2. The gpuscavenger partition: This partition allows users to take advantage of idle GPU capacity on hardware owned by specific projects. Jobs submitted to gpuscavenger may be terminated and automatically re-queued if the nodes are required for higher-priority project jobs.

In the example above, we have included several commands that can help diagnose potential issues with GPU access. Printing the hostname allows you to identify the node on which your job is running. In addition, the ‘nvidia-smi’ command displays information about the installed NVIDIA driver and the available GPUs. Successful output from this command confirms that GPUs are present on the system and that the NVIDIA drivers are correctly installed.

The example job script gpu_hello_apptainer.sh demonstrates how to use Apptainer with PyTorch to run an application on a GPU.

 

The following nodes are currently available

node GPUs device RAM memory (per device) system RAM memory
racc2-gpu-0 3 x Tesla H100 96 GB 384 GB
racc2-gpu-1 2 x Tesla H100 96 GB 768 GB
racc2-gpu-1 4 x Tesla L40S 48 GB 768 GB
racc2-gpu-1 4 x Tesla L40S 48 GB 768 GB

 

News
Suggest Content…

Related articles

RACC2 – Introduction

RACC2 – Login and Interactive Computing

RACC2 – Batch Jobs

RACC2 – Slurm commands