Knowledge Base

RACC2 – Batch Jobs

1. Introduction to batch processing

The strength of the Reading Academic Computing Cluster lies in its ability to run batch jobs. They should be used whenever possible and interactive jobs on the login nodes are only justified for work that truly needs user interaction, such as data visualization or code development and debugging.

 

Overview

Executing a program in batch mode means that a command or a sequence of commands is listed in a text file, which essentially is a set of instructions for the cluster how to run your job. This text file is called a batch script or a job script. This batch script is submitted to a batch scheduler for execution, rather than executing it directly in the shell. It is executed on the compute nodes when the requested computer resources become available, without any further user intervention. The opposite of a batch job is interactive computing, when you enter individual commands to run immediately in the shell.

Typically, compute clusters consist of one or more login nodes which are used to prepare, submit and manage batch jobs and a large number of compute nodes. You can find more details about the RACC2 in our introduction. Batch computing involves users submitting jobs to a software called a batch scheduler, or resource manager, that decides the best and most efficient way to run the jobs while maintaining the highest possible usage of all resources. The scheduler keeps track of all computational resources available on the compute nodes, e.g. CPU cores and memory and of all the jobs submitted by the users. The jobs are then dispatched to suitable compute nodes as sufficient resources become available. Most clusters consist of multiple partitions, or job queues, which group different types of compute nodes and/or are set up with different access and usage policies.

The cluster resources are managed using the SLURM workload manger, see  https://slurm.schedmd.com/ for more information.

 

2. Serial batch jobs

A serial batch job is a job that runs on only one CPU.

 

Our first batch job

A job script consists of two parts: the resource request and one or more commands to run. In the batch script, the resource request instructions start with ‘#SBATCH’ and contain e.g. the specification of required CPU cores, running time, memory etc. Note that the Slurm directives starting with ‘#SBATCH’ are treated as comments by the shell. Consequently, any shell expressions and variables in those lines are not expanded or executed.

The RACC2 has sensible default values of resource request options, which are suitable to submit a short serial job. Relying on those default values, a job script can be as simple as:

#!/bin/bash

./myExecutable.exe

Note that the above is a regular shell script, which could be run interactively (and often it might be a good idea to run the script interactively at first, for testing). Let’s name our job script file script.sh, which we can submit to the scheduler with the command:

$ sbatch script.sh

Once the job is submitted, we can check its status in the queue with the command squeue -u <user name> (it is convenient to specify your user name to see only your jobs as squeue alone will show you everyone’s jobs):

$ squeue -u qx901702
JOBID NAME USER PARTITION STATE NODELIST(REASON) TIME TIME_LIMIT CPUS MIN_MEMO
1229352 script.s qx901702 short PENDING  (Resources)                 0:00 1-00:00:00 1 4G

At first the job might be in ‘PENDING’ state, which means that it is waiting in the queue to be executed. Later on, once the job is running successfully, we will see the the following output:

$ squeue -u qx901702
JOBID NAME USER PARTITION STATE NODELIST(REASON) TIME TIME_LIMIT CPUS MIN_MEMO
1229352 script.s qx901702 short RUNNING compute-5-0 0:52 1-00:00:00 2 4G

If we realize that a running job is no longer needed, we can terminate it with the command scancel followed by the JobID, which is the number that identifies the job and can be inferred from the squeue command (see above):

$ scancel 1229352

Note that by default a job receives the name of the job script. If the job writes to standard output, or to standard error, the default behavior is that both are redirected to the file named slurm-<job ID>.out and placed in the current directory. In our above example it would be slurm-1229352.out. Furthermore, we can see above that the default memory allocation is 4G (i.e. 4 GB)  and the default time limit is 1-00:00:00 (i.e. 1 day). The job runs in the default partition ‘short’, where the default time limit of 24h is also the maximum allowed time limit. If we need more than that, we can customize our resource request as shown in the next section.

 

A custom batch job

In most cases, we want to customize our batch script to ask for the resources that are needed for the job. Consider the following example script:

#!/bin/bash

#these are all the default values anyway 
#SBATCH --ntasks=1 
#SBATCH --cpus-per-task=1  
#SBATCH --threads-per-core=1 

# the job name and output file
#SBATCH --job-name=example_batch_job #(default is the script name)
#SBATCH --output=myout.txt #(default is ~/slurm-<job ID>.out)

# time limit and memory allocation 
#SBATCH --time=2-0:00:00 #(2 days and 0 hours, default is 24 hours) 
#SBATCH --mem=6G #(6 GB, default is 4G)

# 'short' is the default partition, with max time of 24h
# longer jobs are submitted to the partition 'long'
#SBATCH --partition=long

./myExecutable.exe

The above is a single process (–ntasks=1, –cpus-per-task=1) batch job. The significance of ntasks, cpus-per-task and treads-per-core will be discussed in more detail in the sections focusing on parallel batch jobs. Here we will explain them only briefly and you can safely skip the next paragraph if you find it too technical.

‘ntasks’ defines the number of processes that Slurm will automatically start for us. We will set it to value greater than 1 only in case of parallel MPI (distributed memory) jobs. ‘cpus-per-task’ defines the number of CPU cores allocated for the job. We will use it to allow access to multiple CPU cores in parallel jobs that are not ‘standard’ MPI jobs (which could be started with mpirun). This includes OpenMP (shared memory) multi-threaded jobs, or any other cases where the job spawns processes or threads on its own, e.g. when using parallel library functions or parallelization constructs in applications or languages like Matlab, Python, and R. Finally, we will always set ‘–threads-per-core=1’. In most cases, this is the default, but to avoid confusion it is good to always add this in the job script. The reason for adding this line is related to Simultaneous multithreading (SMT), called also hyper-threading in case of Intel processors. Hyper-threading provides two hardware threads (logical CPUs, don’t confuse these with execution threads, in multithreaded programs) per physical CPU core. In the case of CPU intensive numerical modelling jobs typically it is more efficient to not run more processes or threads than the number of physical cores. In Slurm, a logical CPU is defined as a hardware thread. This might be confusing, because we will ask for one CPU with intention to use 1 CPU core and when the job starts be counted as using two CPUs in Slurm, because it is consuming two hardware threads associated with one CPU core. Now, coming back to the ‘cpus-per-task’ allocation request, with ‘threads-per-core=1’ it is guaranteed that the job will have exclusive access to N CPU cores, and to 2N hardware threads. It is also enforced that the job will not exceed this allocation – regardless how many processes or threads will be created, and regardless how many CPU cores the node has, those processes and threads will share only the allowed number of cores and hardware threads. (Experienced users might want to decide if they want to allocate a physical core or a hardware thread for each of their processes or threads, but for numerical modelling jobs one physical core per process or thread is usually more efficient. The other hardware thread can still be used for some non-blocking library calls. e,g, we have found that in MPI jobs, there is a substantial performance gain when the second hardware thread is available for non-blocking MPI calls.

It is useful to specify a custom name for the job and for the output file, which both standard output and standard error are redirected to. Here they are ‘example_batch_job’ and  ‘myout.txt’, respectively.

Now we move on to the trickier part, the job’s running time and memory allocation request. You might not know in advance how long the job will run and how much memory it will need. You will have to estimate this by trial and error and you should aim to be quite accurate in the finish. If you overestimate the requirements, this will be a waste of resources and it might increase the job’s queuing time, especially in the case of over-allocating memory. On the other hand, if you underestimate how much time or memory your job needs, your job will be killed when it exceeds the time limit or when it exceeds the memory allocation. In this example we set the time limit to 2 days and the memory allocation to 6 GB.

The default partition (job queue) is ‘short’. If you need longer running time , you can submit your job in the partition ‘long’ and in the project partitions available to paid projects. The default account is ‘shared’. You do not need to change this. In the past an account needed to be specified for access to paid-for project allocation, but now specifying project partition is sufficient, and access is controlled using security groups. The following section provides more information about partitions, resource limits and resource allocation policies.

Email notifications are disabled by default, they can be enabled with the following directives:

#SBATCH --mail-type=ALL
#SBATCH --mail-user=<your email address>

You might find it confusing that ‘squeue’ will often show that your serial job uses not 1 but 2 CPUs. This is because of hyper-threading. Recall the option ‘–threads-per-core=1’. Modern processors have a certain number of physical cores, and might have a larger number of virtual CPUs.  Typically one physical core is shared by two virtual CPUs (hardware threads). Slurm counts logical CPUs, not physical cores. However, in most cases of numerical modelling jobs, it doesn’t make much sense for two computing threads to share the same physical core. Therefore a whole physical core, which has two virtual CPUs, is allocated to the job.

 

Resource allocation policies

A batch job is submitted to one or more partitions (job queues). Different partitions are associated with different access policies, resource allocation policies, resource limits, and might be providing access to different nodes. Selecting the right partition will allow efficient use of resources and will shorten job’s queuing time and running time. Those are the partitions currently in use:

  • short – the default partition, the time limit is 24 hours, this partition is suitable for small and medium size parallel jobs and small to medium numbers of serial jobs.
  • long – this partition is intended for small numbers of serial and medium sized parallel jobs that require longer running time. The maximum time limit is 30 days.This partition allows to submit and run fewer jobs and allocate less CPUs and memory that the partition short.

For experienced users and for large resource requirements:

  • scavenger – this partition provides free access to all nodes, including specialised nodes and the new nodes purchased by projects. The intention is to allow using all idle time on the project-owned nodes and on all the other nodes for free. This partition includes the new, most efficient nodes, so it is recommended this partition is used whenever it is practical. There is no limit how much of the available resources you can use simultaneously. However, in case the resources are requested for higher priority or more suitable jobs, the job in the scavenger partition might be killed and re-queued. The maximum time limit is 3 days, but the default time limit is 24 hours, and we recommend to not increase it, to avoid wasting too much of already consumed CPU time in case the jobs are re-queued. We no longer use suspending jobs as the way to free up resources for higher priority jobs.

For paying projects:

  • each project have their own partition, and access is controlled using a security group named in the form racc_<project (partition) name>. The standard project partitions are sharing the same shared, uniform  pool of nodes funded by projects, and the resources limits are proportional to project’s contribution. There might also be some project partitions with access to their own specialised nodes.

Job Arrays

Job arrays allow to easily manage large collections of similar jobs, e.g. to run the same program on a collection of input files, on a range of input parameters, or to run a number of independent stochastic simulations for statistical averaging. We recommend using job arrays whenever possible, because it reduces the load on the scheduler.

Submitting a job array is as simple as adding ‘–array=<index range>’ to the job submission parameters. We can specify the task IDs as a range of consecutive integers: ‘–array=1-10’, a list: ‘–array=1,3,5’, or with an increment ‘–array=1-10:2’.

In the following example we submit a job consisting of 100 array tasks:

#!/bin/bash

#SBATCH --ntasks=1
#SBATCH --array=1-1000
#SBATCH --job-name=example_job_array
#SBATCH --output=arrayJob_%A_%a.out 
#SBATCH --time=120:00      #(120 minutes)
#SBATCH --mem=512  #(512 MB)
#SBATCH --partition=scavenger #jobs might be killed and re-queued, only this partition allows submit and run so many array tasks
echo “This array task index is $SLURM_ARRAY_TASK_ID”

./myExecutable.exe input_file_no_${SLURM_ARRAY_TASK_ID}.in

The job tasks are distinguished by their ID provided by the environment variable SLURM_ARRAY_TASK_ID. ‘%A’ and ‘%a’ evaluate to the job ID and task ID and can be used to construct output file names. The above job takes a number of input files, one for each individual array task.

In the above example, we submitted the array job to the partition ‘scavenger’. We need to be prepared that some array tasks might be killed and re-queued. The advantage is that we can submit large array jobs and we have access to many CPU cores.

 

3. Parallel batch jobs

A parallel batch job is a job that runs on multiple CPUs.

 

A very brief introduction to OpenMP and MPI

OpenMP (Open Multiprocessing) is a programming platform that allows one to parallelise code over a homogeneous shared memory system (e.g., a multi-core processor). For instance, one could parallelise a set of operations over a multi-core processor where the cores share memory between each other.

MPI (Message Passing Interface ) is a programming platform with the ability to parallelise code over a (non-)homogenous distributed system (e.g., a supercomputer). For instance, it’s possible to parallelise an entire program over a network of computers, or nodes, which communicate over the same network but have their own memory layout.

Shared memory allows multiple processing elements to share the same location in memory (that is to see each others reads and writes) without any other special directives, while distributed memory requires explicit commands to transfer data from one processing element to another.

The schematic below illustrates the principles of distributed memory and shared memory parallel computing.

 

 

Shared memory jobs (e.g. openMP) and parallel libraries (e.g. in Matlab, Python, etc)

This is an example of a 16-way OpenMP job.

#!/bin/bash

#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --threads-per-core=1
#SBATCH --job-name=test_smp
#SBATCH --output=myout.txt
#SBATCH --time=120:00
#SBATCH --mem-per-cpu=512

export OMP_NUM_THREADS=16
./a.out

In the above script, one task is requested with 16 CPU cores to be allocated for this task. The executable can use up to 16 CPU cores for its threads or processes. In a similar fashion, parallel Matlab jobs (not tested) can be launched (with the Matlab parallel toolbox, but without the Matlab parallel server), or any other applications which are using multiple CPUs and managing them on their own.

Using ‘–cpus-per-task’ is a bit tricky because of hyperthreading. For Slurm the CPU is a logical CPU (hardware thread). On the RACC2 Slurm is configured to always allocate a whole physical core to a task. But in the directive ‘–cpus-per-task’, we are counting Slurm’s CPUs. In the case of processors with hyperthreading, these are logical CPUs (hardware threads). The nodes on the RACC2 have hyperthreading functionality which means there are two logical CPUs per physical core. If you are happy to count CPUs as hardware threads, that is easy and consistent, in both cases. However, often it is better to run just one thread per physical core and then some customization of the job, depending on the compute node capability, is needed.

 

MPI jobs

This is an example of a 16-way MPI job.

#!/bin/bash

#SBATCH --ntasks=16
#SBATCH --cpus-per-task=1
#SBATCH --threads-per-core=1
#SBATCH --nodes=1-1
#SBATCH --job-name=test_mpi
#SBATCH --output=myout.txt
#SBATCH --time=120:00
#SBATCH --mem-per-cpu=512

module load mpi/mpich-x86_64
srun myMPIexecutable.exe

The above script requests 16 tasks for 16 MPI processes. Typically it is better to have all the processes running on the same node, which is requested with ‘–nodes=1-1’. In SLURM, the number of tasks represents the number of instances the command is run, typically done by a single srun command. In the above script, we have 16 identical processes, hence this is a slice with 16 tasks and not just one task using 16 CPU cores.

The srun command will take care of creating and managing MPI processes, it replaces the mpirun or mpiexec commands. It is important to add ‘#SBATCH –threads-per-core=1’ when srun manages MPI processes in order to get correct CPU affinity binding. With this option, similar to the case of serial jobs, an MPI process (task) gets access to a whole physical core, when possible, and then it is counted as two CPUs in Slurm. The extra hardware thread on each core is used for MPI I/O calls and contributes to better performance.

 

News
Suggest Content…

Related articles

RACC2 – Introduction

RACC2 – Login and Interactive Computing

RACC2 – Slurm commands