Computations on the cluster are performed in so-called jobs. A job is a self-contained, user-defined task (or set of smaller tasks). As a user, you also specify how many resources (nodes, cores, RAM etc.) a job needs, and for how long. This job is then put into a waiting queue, and a software called a scheduler starts it as soon as enough resources are available.
The scheduler that is in use on the OMNI cluster is called SLURM (Simple Linux Utility for Resource Management), the installed version is 20.02.5. SLURM is operated with a number of console commands for queuing jobs, monitoring them and, if necesary, canceling them. The most important commands are introduced here and listed here separately.
A job usually consists of a Linux command to be executed or a (shell) script with a set of commands.
The queues on the HoRUS cluster have been configured by ZIMT and mostly differ by the allowed maximum run time. Within SLURM, the term “partition” is used instead of “queue”.
The following queues are configured on the cluster:
PartitionName DefaultTime MaxTime debug 0:15:00 01:00:00 short* 01:00:00 02:00:00 medium 12:00:00 1-00:00:00 long 5-00:00:00 20-00:00:00 gpu 12:00:00 1-00:00:00 smp 12:00:00 5-00:00:00 expert 1-00:00:00 1-00:00:00 htc 12:00:00 5-00:00:00
They can also be displayed with the commands
spartition, which was added by ZIMT. Additional things to note:
- short is the default queue. Jobs where no partition is explicitly specified will be put into
- gpu is for jobs that need GPUs. The GPUs need to be requested specifically in the SLURM job settings, or the job will not use them.
- smp is the queue for the SMP nodes (SMP = Shared Multiprocessing), two larger nodes each with 1536 GB RAM which are intended for particularly RAM-intensive computations.
- expert is a special queue which allows very large jobs. If you would like to use it, please contact us. We decide access for the expert queue on a case-by-case basis.
- htc is the queue for the Moonshot nodes.
The usual workflow when creating a job is as follows:
- You write a job script. The script is intended to do the following things:
- Call your software
- Define your job settings
- Load any environment necessary.
- Other tasks that might be necessary for your job
- You prepare your software (e.g. compile it) and your data (e.g. parameter files). If necessary, you allocate a workspace.
- You queue your script into the right partition with
- You wait until your job starts.
As soon as your job has started, you should check if everything was set up correctly and that your software is running as intended.
You wait until your job has finished.
Tip: you can connect to the compute nodes that run your job via
ssh from the login node and run
top to see if your software is running. The
squeue command will show on which nodes your job is running after it started. Users only have SSH access to compute nodes that have a job of that user running.
- The fewer resources (tasks/nodes) requested and the shorter the runtime required, the shorter your waiting time will be. You should however leave a generous reserve when selecting the time because jobs will be terminated immediately upon reaching the time limit.
- For the same reason you should put a job into the shortest possible queue (e.g.
shortwhen your runtime is below two hours).
- Due to unpredictable circumstances it is always possible that a job may abort early. If possible, you should save intermediate results to the hard drive regularly (so-called checkpointing). Depending on your software, it might be possible to restart the calculation at a checkpoint instead of re-running it from the beginning.
- Every SLURM job gets a consecutive number called the Job ID. It will be displayed when queuing the job and serves as a unique identifier for it.
- SLURM also has the capability to inform you via e-mail about the status of your job.
Creating a job script
Here is an example of a job script that is used to run the finite-element application Abaqus:
#!/bin/bash #SBATCH --time=0:20:00 #SBATCH --nodes=1 #SBATCH --tasks-per-node=16 #SBATCH --mem 128000 #SBATCH --partition=short module load abaqus echo "Number of tasks: " echo $SLURM_NTASKS abaqus job=Test.inp mp_mode=mpi interactive cpus=$SLURM_NTASKS
The meaning of the individual commands is:
#!/bin/bashmeans that the script is to be run with the Bash shell. This does not have to be a Linux console, a Python script for example may be run with
- The options beginning with
#SBATCHare instructions to SLURM about which job settings to use. In this case, a full node is allocated and 16 tasks (parallel processes) are started on it. Most of these options have default values and do not necessarily need to be specified. Possible options can be found here and in the SLURM documentation.
- Next, some additional operations are performed. In this case a necessary module is loaded, but in principle these can be any Linux commands (e.g. copying files, creating directories).
- For demonstration purposes the value of the environment variable
SLURM_NTASKSis output. SLURM sets some environment variables for each job which provide information about the job to programs within it. In this case,
SLURM_NTASKS=16as per job settings.
- Finally, the application is called, Abaqus in this case. Note how the environment variable is used within the options for Abaqus, so that Abaqus is run with the correct number of parallel processes.
A job script is queued with the command
sbatch, for example:
$ sbatch --time 0:30:00 example.sh
example.sh does not need to be executable. The
sbatch command accepts a number of options (command line arguments) which, when specified, will supersede the ones given in the script. In this example, the script would not be run with a runtime of 20 minutes as specified in the script above, but with a runtime of 30 minutes. A complete list of options can be shown with
sbatch -h and is also described in the SLURM documentation.
Sometimes an application may need user input while it is running, but also require so many resources that it needs to run on the cluster. For this purpose, the possibility to create interactive jobs exists. In an interactive job a console stays open and you can run any application within the job.
Caution: It is forbidden to run CPU-heavy applications on the login nodes because they affect all users negatively. The HPC team reserves the right to terminate such applications without prior warning.
An interactive job can be started with the following command:
$ srun --pty /bin/bash
srun command queues a SLURM job to run a single command (by comparison,
sbatch runs a single script). In this example, a Bash shell is started. The option
--pty (“pseudo terminal”) makes the job interactive. The
srun command accepts most of the same options about number of nodes, partition and runtime as
sbatch. All options can be displayed with
- It is not necessary to run a shell here,
srunmay run any Linux command.
- If an interactive job is queued, the console will be stuck until it starts. If you want to continue working on the cluster until the job starts, the easiest option is to open a second console with an SSH connection to the cluster.
SLURM offers a number of commands for monitoring jobs. Only the most important one are listed here, a more extensive list is here. Usually, the options for each command can be displayed with
You can see which jobs are queued by typing
squeue. This can be refined further, for example with
squeue -p <queue name> only a specific partition (queue) is listed, with
squeue -u <user name> only a specific user’s jobs are shown.
The status of the cluster nodes (idle, busy, etc.) can be shown with
sinfo, information about the configuration of the queues with
spartition. More detailed information for an individual job will be displayed when entering
scontrol show job <Job ID>.
A queued job can be cancelled with
scancel <Job ID>, independently of whether it already started running or not. All your own jobs can be cancelled with
scancel -u <Your username>.
You have quite a bit of control over which resources you can request and it is advisable that you think about it ahead of time.
For example, if your application does not run in parallel but needs the entire RAM of a node, you can reserve a complete node.
$ sbatch --nodes 1 <scriptname>
Note that you could also use
-Nwhich is a shorthand for
When starting multiple processes, you also have to specify multiple tasks. For example for MPI programs one task equals one process and a program with 24 MPI processes can be queued with
$ sbatch --ntasks 24 <scriptname>
and within the job script
mpirun -np 24needs to be specified. The option
--ntaskshas the shorthand
-n. Alternatively, the number of nodes and tasks can be specified:
$ sbatch -N 2 --tasks-per-node 12 <scriptname>
In this example, like in the previous one, 24 tasks would be started. But now it would be guaranteed that these tasks are all on the same 2 nodes. You cannot start more tasks on a node than it has CPUs and SLURM will display an error message when queuing such a job.
When a program with multiple threads is to be run, for example when OpenMP is used, only one task is used for the program. However one task can also be allocated multiple CPUs.
$ sbatch --nodes 1 --cpus-per-task 16 <scriptname>
In this example, a complete node would be used for a single OpenMP program.
When you need to run a program multiple times in parallel within a job, you can do that with
srun. Within job scripts,
srunworks differently than if you were to enter it in the console: no new job is queued, but the command is run multiple times within the existing job. When you have an MPI program however, an
mpirunis sufficient and no additional
srunis needed in front of it. As mentioned above, SLURM will create environment variables that contain job information. For example,
$SLURM_NTASKScontain the number of tasks available to the current job. An MPI program can then be called in the job script with
$ mpirun -np $SLURM_NTASKS <program name>
and it is run with the correct number of MPI processes.
Caution: the variable
SLURM_NTASKSis only set when the number of tasks is explicitly specified (e.g. via
--ntasks), either in the job script or when queueing the job.