Knowledge Base

Reading Academic Computing Cluster – Slurm commands and resource allocation policy

(work in progress)

Resources and limits

Partitions:

In SLURM, partitions are (possibly overlapping) groups of nodes. Partitions are similar to queues in some other batch systems, e.g. in SGE on met-cluster and maths-cluster. The default partition is called ‘cluster’ and it is has 24 hours default time limit, and the maximum time limit is 30 days, but we do not recommend running jobs that long. There is also the ‘limited’ partition, with maximum time limit of 24 hours. The partition ‘limited’ allows access to some of the ‘project’ nodes.

partition limits description
cluster
limited
gpu
gpu_limited
project
custom project partitions

The cluster currently consists of over 50 compute nodes, with core counts varying between 8-cores up to 24-cores per node. There is also a node with 4 GPU devices.

Time and memory limits:

The default time limit in the ‘cluster’ partition is 24 hours and the default memory limit is 1 GB per CPU core. The maximum time limit is 30 days, and the maximum memory limit is not set, it is limited only by the hardware capacity. Users are expected to properly estimate their CPU and memory requirements. Over-allocating resources will prevent other users from accessing unused memory and CPU time. Also, previously consumed resources are used to compute user’s fair-share priority factor, and overprovisioning jobs will have a negative effect on future user’s job priority.

The CPU and memory limits will be strictly enforced by the scheduler. Tasks will be limited to run on the requested number of CPU cores, e.g. if you request one CPU core and run a parallel job, all your threads or processes will run on a single CPU core. Processes, which exceed their memory allocation will be killed by the scheduler.

Fair share policies:

Jobs waiting to be scheduled are ordered according to a multifactor priority. The fair-share component is based on

  • the amount of resources already allocated to the user, i.e. the CPUs and memory being used by user’s running jobs,
  • on resources consumed by the user in the past,

and the job component depends on

  • the resource request of the jobs – short and small memory jobs will start faster,
  • the partition, and on the (not implemented yet) quality of service factor (QOF),

Power saving

Inactive nodes are automatically shut down. The ‘~’ in the node status shown by the command ‘sinfo’, means that the node is switched off.  When you submit a job, such nodes will be automatically switched on and there will be only a short delay before the job starts running, to allow the servers to boot. It takes 6 minutes to boot up all the 17 nodes, but it is much quicker when fewer nodes need to be started.

It should be noted that after a node has just powered up, there might be some problems with MPI jobs, related to the automounter and sssd. A workaround is to add a dummy job slice in the batch script, e.g. ‘cd’, which will fail, but it will force automounting the home directory and the production task will run fine.

 

SLURM commands

Overview:

Available cluster resources can be displayed with the command sinfo

Running jobs can be displayed with the command squeue (better with -l flag)

Further, job accounting data can be obtained with the command sacct

Batch jobs are submitted using the command sbatch

The commands salloc and srun allow to interactively run tasks on the compute nodes (this is not an interactive session known by met-cluster users)

Jobs can be killed with  scancel

Monitoring cluster resources with ‘sinfo’:

As ‘cluster’ is the default partition, it is convenient to display the resources for just this one, by adding ‘-p cluster’ to the ‘sinfo’ command. By default, the nodes which are in the same state are grouped together.

$ sinfo -p cluster

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
cluster*     up   infinite      7  idle~ compute-0-[5-11]
cluster*     up   infinite      1    mix compute-0-0
cluster*     up   infinite      4  alloc compute-0-[1-4]

The above output shows that nodes 5-11 are idle, and ‘~’ means they are switched off to save power. Nodes 1-4 are fully allocated, meaning they will not be available for new jobs until the jobs currently running on them are finished. The mix state for the compute-0-0 node means that some of the cores on the node are in use and some of them are free.

Further details can be displayed using the ‘-o’ flag. See the manual page, ‘man sinfo’, for more details on format specifiers. In this example, the number of CPU cores are displayed with the command:

$ sinfo -p cluster -o "%P %.6t %C"
PARTITION  STATE CPUS(A/I/O/T)
cluster*  idle~ 0/112/0/112
cluster*    mix 8/8/0/16
cluster*  alloc 64/0/0/64

A/I/O/T stands for Allocated/Idle/Other/Total. The idle and switched off nodes have 112 cores available. There is a node with 8 cores allocated, and another 8 cores idle. In total, there are 120 cores (8 + 112) available for new jobs.

Nodes can be listed individually by adding the ‘-N’ flag:

$ sinfo -p cluster -N -o "%N %.6t %C"
NODELIST  STATE CPUS(A/I/O/T)
compute-0-0  alloc 16/0/0/16
compute-0-1  alloc 16/0/0/16
compute-0-2  alloc 16/0/0/16
compute-0-3  alloc 16/0/0/16
compute-0-4  idle~ 0/16/0/16
compute-0-5  idle~ 0/16/0/16
compute-0-6  idle~ 0/16/0/16
compute-0-7  idle~ 0/16/0/16
compute-0-8   mix  8/8/0/16
compute-0-9  idle~ 0/16/0/16
compute-0-10  idle~ 0/16/0/16
compute-0-11  idle~ 0/16/0/16

Monitoring jobs with squeue

A more informative output formatting can be achieved with the follwing

squeue -o “%.18i %.8u %.8a %.9P %q %.8j %.8T %.12M %.12l %.5C %.10R %p

it might be a good idea to add this as an alias in your .bashrc.