Knowledge Base

Reading Academic Computing Cluster – parallel batch jobs

(work in progress)

Parallel batch jobs

Parallel jobs in SLURM

In the SGE resource manager on the met-cluster and the maths-cluster, a job is simply requested with a number of CPU slots allocated to it. On those clusters it is up to the user to spawn required processes, and it is user’s responsibility to not oversubscribe their allocation. SLURM offers more help and flexibility in starting parallel jobs. It will also forcibly limit the resources available to the job, to those specified in the job allocation.

In SLURM, a job can consist of multiple job slices. A job slice is a command or a script. In the simplest case, the execution of the commands in the job script itself is the only job slice; an example is the serial batch job script above. Other job slices within a batch script can be started using the srun command. Jobs slices can run either in parallel or sequentially within the job allocation.

A task can be interpreted as an instance of a job slice. SLURM can start a number of identical tasks in parallel for each job slice, as specified with the flag ‘—ntasks’ for that job slice i.e. for the job script as a whole, or for the srun call.

A task can have more than one CPU cores allocated, such that the user application can spawn more processes and threads on its own. For example, if we run an openMP multi-threaded job, or an application like R or Matlab, which might be starting more processes or threads on its own, we just request one task to start the application, but with the suitable number of CPUs for this single task.

An example of a multi task job is an MPI job. We just specify the number of tasks we want with ‘—ntasks’ and SLURM will start parallel MPI processes for us. Those tasks might use more than one CPU per task if they are multi-threaded.

To have more job slices we use more calls to the srun command in the job script. For example, we can have a job consisting of a data producing slice, possibly with many parallel tasks, running in parallel with a data collector slice, again with many tasks, and many CPUs per task if needed.

Distributed memory batch jobs (e.g. MPI)

This is an example of 16-way MPI job.

#!/bin/bash

#SBATCH --ntasks=16
#SBATCH --cpus-per-task=1
#SBATCH --nodes=1-1
#SBATCH --job-name=test_mpi
#SBATCH --output=myout.txt
#SBATCH --time=120:00
#SBATCH --mem-per-cpu=512

module load MPI/mpich/gcc/3.2.1
srun cd 2>/dev/null #a workaround, needed when node powers up
srun myMPIexecutable.exe

The above script requests16 tasks for 16 MPI processes. Typically it is better to have all the processes runing on the same node, it is requested with ‘–nodes=1-1’. In SLURM, the number of tasks represents the number of instances the command is run, typically done by a single srun command. In the above script, we have 16 identical processes, hence this is a slice with 16 tasks and not just one task using 16 CPU cores.

The mpich library version loaded by the module command is built with SLURM support. The srun command will take care of creating and managing MPI processes, it replaces the mpirun or mpiexec commands.

An example of an MPI program and job script can be found here:

/share/slurm_examples/mpi

Similar like in the case of serial jobs, an MPI process (task) gets access to a whole physical core, when possible, and then it is counted as two CPUs in Slurm.

Shared memory batch jobs (e.g. openMP)

#!/bin/bash

#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --threads-per-core=1
#SBATCH --job-name=test_smp
#SBATCH --output=myout.txt
#SBATCH --time=120:00
#SBATCH --mem-per-cpu=512

export OMP_NUM_THREADS=16
./a.out

In the above script, one task is requested with 16 CPU cores allocated for this task. The executable can use up to 16 CPU cores for its threads or processes. In a similar fashion, parallel Matlab jobs (not tested) can be launched (with Matlab parallel toolbox, but without Matlab parallel server), or any other applications using multiple CPUs and managing them on their own.

Using ‘–cpus-per-task’ is a bit tricky because of hyperthreading. For Slurm the CPU is a logical CPU (hardware thread). In RACC Slurm is configured to always allocate a whole physical core to a task. But, in case of ‘–cpus-per-task’, we are counting Slurm’s CPUs i.e. in case of processors with hyperthreading these are logical CPUs (hardware threads). On most physical nodes we have there is hyperthreading and there are two logical CPUs per physical core. On the VM nodes the CPUs are hardware threads allocated by the hypervisor. If you are happy to count CPU as hardware threads that’s easy and consistent, in both cases. However, often it is better to run just one thread per physical core, and then some customization of the job, depending on the compute node capability, is needed.

 

News
Suggest Content…