Queuing a Job

Computations on the cluster are performed in so-called jobs. A job is a self-contained, user-defined task (or set of smaller tasks). As a user, you also specify how many resources (nodes, cores, RAM etc.) a job needs, and for how long. This job is then put into a waiting queue, and a software called a scheduler starts it as soon as enough resources are available.

The scheduler that is in use on the OMNI cluster is called SLURM (Simple Linux Utility for Resource Management), the installed version is 20.02.5. SLURM is operated with a number of console commands for queuing jobs, monitoring them and, if necesary, canceling them. The most important commands are introduced here and listed here separately.

A job usually consists of a Linux command to be executed or a (shell) script with a set of commands.

The queues on the HoRUS cluster have been configured by ZIMT and mostly differ by the allowed maximum run time. Within SLURM, the term “partition” is used instead of “queue”.

On this page you can find information about writing a job script, queuing a job and monitoring it, as well as tips on which job settings to use.

Waiting queues

The following queues are configured on the cluster (and can be displayed with the spartition command):

$ spartition

PartitionName   DefaultTime       MaxTime   MaxNodesPerJob

        debug      00:15:00      01:00:00                2
       expert    1-00:00:00    1-00:00:00              128
          gpu      12:00:00    1-00:00:00                2
          htc      12:00:00    5-00:00:00                2

      jupyter      12:00:00    1-00:00:00                1
         long    5-00:00:00   20-00:00:00                8
       medium      12:00:00    1-00:00:00               32
        short      01:00:00      02:00:00               32
          smp      12:00:00    5-00:00:00                1

Additional things to note:

short is the default queue. Jobs where no partition is explicitly specified will be put into short.
gpu is for jobs that need GPUs. The GPUs need to be requested specifically in the SLURM job settings, or the job will not use them.
smp is the queue for the SMP nodes (SMP = Shared Multiprocessing), two larger nodes each with 1536 GB RAM which are intended for particularly RAM-intensive computations.
expert is a special queue which allows very large jobs. If you would like to use it, please contact us. We decide access for the expert queue on a case-by-case basis.
htc is the queue for the Moonshot nodes.
juypter is a queue reserved for Jupyter jobs.

Job scripts

The usual workflow when creating a job is as follows:

Workflow

You write a job script. The script is intended to do the following things:
- Call your software
- Define your job settings
- Load any environment necessary.
- Other tasks that might be necessary for your job
You prepare your software (e.g. compile it) and your data (e.g. parameter files). If necessary, you allocate a workspace.
You queue your script into the right partition with sbatch.
You wait until your job starts.
As soon as your job has started, you should check if everything was set up correctly and that your software is running as intended.
You wait until your job has finished.

Tip: you can connect to the compute nodes that run your job via ssh from the login node and run top to see if your software is running. The squeue command will show on which nodes your job is running after it started. Users only have SSH access to compute nodes that have a job of that user running.

General considerations

The fewer resources (tasks/nodes) requested and the shorter the runtime required, the shorter your waiting time will be. You should however leave a generous reserve when selecting the time because jobs will be terminated immediately upon reaching the time limit.
For the same reason you should put a job into the shortest possible queue (e.g. short when your runtime is below two hours).
Due to unpredictable circumstances it is always possible that a job may abort early. If possible, you should save intermediate results to the hard drive regularly (so-called checkpointing). Depending on your software, it might be possible to restart the calculation at a checkpoint instead of re-running it from the beginning.
Every SLURM job gets a consecutive number called the Job ID. It will be displayed when queuing the job and serves as a unique identifier for it.
SLURM also has the capability to inform you via e-mail about the status of your job.

Creating a job script

Here is an example of a job script that is used to run the finite-element application Abaqus:

#!/bin/bash
#SBATCH --time=0:20:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=16
#SBATCH --mem 128000
#SBATCH --partition=short

module load abaqus

echo "Number of tasks: "
echo $SLURM_NTASKS

abaqus job=Test.inp mp_mode=mpi interactive cpus=$SLURM_NTASKS

The meaning of the individual commands is:

#!/bin/bash means that the script is to be run with the Bash shell. This does not have to be a Linux console, a Python script for example may be run with #!/usr/bin/python.
The options beginning with #SBATCH are instructions to SLURM about which job settings to use. In this case, a full node is allocated and 16 tasks (parallel processes) are started on it. Most of these options have default values and do not necessarily need to be specified. Possible options can be found on our website (see SLURM) and in the SLURM documentation.
Next, some additional operations are performed. In this case a necessary module is loaded, but in principle these can be any Linux commands (e.g. copying files, creating directories).
For demonstration purposes the value of the environment variable SLURM_NTASKS is output. SLURM sets some environment variables for each job which provide information about the job to programs within it. In this case, SLURM_NTASKS=16 as per job settings.
Finally, the application is called, Abaqus in this case. Note how the environment variable is used within the options for Abaqus, so that Abaqus is run with the correct number of parallel processes.

Queuing a job

A job script is queued with the command sbatch, for example:

$ sbatch --time 0:30:00 example.sh

The script example.sh does not need to be executable. The sbatch command accepts a number of options (command line arguments) which, when specified, will supersede the ones given in the script. In this example, the script would not be run with a runtime of 20 minutes as specified in the script above, but with a runtime of 30 minutes. A complete list of options can be shown with sbatch -h and is also described in the SLURM documentation.

Interactive jobs

Sometimes an application may need user input while it is running, but also require so many resources that it needs to run on the cluster. For this purpose, the possibility to create interactive jobs exists. In an interactive job a console stays open and you can run any application within the job.

Caution: It is forbidden to run CPU-heavy applications on the login nodes because they affect all users negatively. The HPC team reserves the right to terminate such applications without prior warning.

An interactive job can be started with the following command:

$ srun --pty /bin/bash

The srun command queues a SLURM job to run a single command (by comparison, sbatch runs a single script). In this example, a Bash shell is started. The option --pty (“pseudo terminal”) makes the job interactive. The srun command accepts most of the same options about number of nodes, partition and runtime as sbatch. All options can be displayed with srun --help.

Tips:

It is not necessary to run a shell here, srun may run any Linux command.
If an interactive job is queued, the console will be stuck until it starts. If you want to continue working on the cluster until the job starts, the easiest option is to open a second console with an SSH connection to the cluster.

Job monitoring and cancelling

SLURM offers a number of commands for monitoring jobs. Only the most important one are listed here, a more extensive list is here. Usually, the options for each command can be displayed with --help.

You can see which jobs are queued by typing squeue. This can be refined further, for example with squeue -p <queue name> only a specific partition (queue) is listed, with squeue -u <user name> only a specific user’s jobs are shown.

The status of the cluster nodes (idle, busy, etc.) can be shown with sinfo, information about the configuration of the queues with spartition. More detailed information for an individual job will be displayed when entering scontrol show job <Job ID>.

A queued job can be cancelled with scancel <Job ID>, independently of whether it already started running or not. All your own jobs can be cancelled with scancel -u <Your username>.

Tips

You have quite a bit of control over which resources you can request and it is advisable that you think about it ahead of time.

For example, if your application does not run in parallel but needs the entire RAM of a node, you can reserve a complete node.
```
$ sbatch --nodes 1 <scriptname>
```
Note that you could also use -N which is a shorthand for --nodes.
When starting multiple processes, you also have to specify multiple tasks. For example for MPI programs one task equals one process and a program with 24 MPI processes can be queued with
```
$ sbatch --ntasks 24 <scriptname>
```
and within the job script mpirun -np 24 needs to be specified. The option --ntasks has the shorthand -n. Alternatively, the number of nodes and tasks can be specified:
```
$ sbatch -N 2 --tasks-per-node 12 <scriptname>
```
In this example, like in the previous one, 24 tasks would be started. But now it would be guaranteed that these tasks are all on the same 2 nodes. You cannot start more tasks on a node than it has CPUs and SLURM will display an error message when queuing such a job.
When a program with multiple threads is to be run, for example when OpenMP is used, only one task is used for the program. However one task can also be allocated multiple CPUs.
```
$ sbatch --nodes 1 --cpus-per-task 16 <scriptname>
```
In this example, a complete node would be used for a single OpenMP program.
When you need to run a program multiple times in parallel within a job, you can do that with srun. Within job scripts, srun works differently than if you were to enter it in the console: No new job is queued, but the command is run multiple times within the existing job. When you have an MPI program however, an mpirun is sufficient and no additional srun is needed in front of it. As mentioned above, SLURM will create environment variables that contain job information. For example, $SLURM_NTASKS contain the number of tasks available to the current job. An MPI program can then be called in the job script with
```
$ mpirun -np $SLURM_NTASKS <program name>
```
and it is run with the correct number of MPI processes.

Caution: the variable SLURM_NTASKS is only set when the number of tasks is explicitly specified (e.g. via --ntasks-per-node or --ntasks), either in the job script or when queueing the job.

Aktualisiert um 15:23 am 8. February 2021 von Gerd Pokorra

Cluster-News