Wiss. Rechnen » Queuing a Job
 

Computations on the cluster are performed in so-called jobs. A job is a self-contained, user-defined task (or set of smaller tasks). As a user, you also specify how many resources (nodes, cores, RAM etc.) a job needs, and for how long. This job is then put into a waiting queue, and a software called a scheduler starts it as soon as enough resources are available.

The scheduler that is in use on the HoRUS cluster is called SLURM (Simple Linux Utility for Resource Management), the installed version is 17.02.2. SLURM is operated with a number of console commands for queuing jobs, monitoring them and, if necessary, canceling them. The most important commands are introduced here and listed here separately.

job usually consists of a Linux command to be executed or a (shell) script with a set of commands.

The queues on the HoRUS cluster have been configured by ZIMT and mostly differ by the allowed maximum run time. Within SLURM, the term “partition” is used instead of “queue”.

On this page you can find information about writing a job scriptqueuing a job and monitoring it, as well as tips on which job settings to use.

Waiting queues

The following queues are configured on the cluster:

PartitionName   DefaultTime       MaxTime

        short      01:00:00      02:00:00
       medium      12:00:00    1-00:00:00
         defq    2-00:00:00    4-00:00:00
         long    5-00:00:00   20-00:00:00

      medium2      12:00:00    1-00:00:00
          smp    5-00:00:00   20-00:00:00
          htc    1-00:00:00    1-00:00:00

They can also be displayed with the commands spartition, which was added by ZIMT. Additional things to note:

  • defq is the default queue. Jobs where no partition is explicitly specified will be put into defq.
  • medium2 is a partition of 20 nodes with more RAM and more cores (see also here) which have been financed by the Lehrstuhl für Strömungstechnik und Strömungsmaschinen (Department of Mechanical Engineering). Members of that chair have priority but the queue may be used by anyone.
  • smp is the queue for the smp1 node, a single node with 512 GB RAM which is intended for particularly RAM-intensive computations.
  • htc is the queue for the Moonshot nodes.

Job scripts

The usual workflow when creating a job is as follows:

Workflow

  1. You write a job script. The script is intended to do the following things:
    • Call your software
    • Define your job settings
    • Load any environment necessary.
    • Other tasks that might be necessary for your job
  2. You prepare your software (e.g. compile it) and your data (e.g. parameter files). If necessary, you allocate a workspace.
  3. You queue your script into the right partition with sbatch.
  4. You wait until your job starts.
  5. As soon as your job has started, you should check if everything was set up correctly and that your software is running as intended.
    Tip: you can connect to the compute nodes that run your job via ssh from the login node and run top to see if your software is running. The squeuecommand will show on which nodes your job is running after it started.
  6. You wait until your job has finished.

General considerations

  • The fewer resources requested and the shorter the runtime required, the shorter your waiting time will be. You should however leave a generous reserve when selecting the time because jobs will be terminated immediately upon reaching the time limit.
  • For the same reason you should put a job into the shortest possible queue (e.g. shortwhen your runtime is below two hours).
  • Due to unpredictable circumstances it is always possible that a job may abort early. If possible, you should save intermediate results to the hard drive regularly (so-called checkpointing). Depending on your software, it might be possible to restart the calculation at a checkpoint instead of re-running it from the beginning.
  • Every SLURM job gets a consecutive number called the Job ID. It will be displayed when queuing the job and serves as a unique identifier for it.
  • SLURM also has the capability to inform you via e-mail about the status of your job.

Creating a job script

Here is an example of a job script that is used to run the finite-element application Abaqus:

#!/bin/bash
#SBATCH --time=0:20:00
#SBATCH --nodes=1
#SBATCH --tasks-per-node=6
#SBATCH --mem 48000
#SBATCH --partition=short
module load abaqus/2017

echo "Number of tasks: " echo $SLURM_NTASKS

abq2017hf9 job=Test.inp mp_mode=mpi interactive cpus=$SLURM_NTASKS

The meaning of the individual commands is:

  • #!/bin/bash means that the script is to be run with the Bash shell. This does not have to be a Linux console, a Python script for example may be run with #!/usr/bin/python.
  • The options beginning with #SBATCH are instructions to SLURM about which job settings to use. In this case, a full node is allocated and 6 tasks (parallel processes) are started on it. Most of these options have default values and do not necessarily need to be specified. Possible options can be found here and in the SLURM documentation.
  • Next, some additional operations are performed. In this case a necessary module is loaded, but in principle these can be any Linux commands (e.g. copying files, creating directories).
  • For demonstration purposes the value of the environment variable SLURM_NTASKS is output. SLURM sets some environment variables for each job which provide information about the job to programs within it. In this case, SLURM_NTASKS=6 as per job settings.
  • Finally, the application is called, Abaqus in this case. Note how the environment variable is used within the options for Abaqus, so that Abaqus is run with the correct number of parallel processes.

Queuing a job

A job script is queued with the command sbatch, for example:

$ sbatch --time 0:30:00 example.sh

The script example.sh does not need to be executable. The sbatch command accepts a number of options (command line arguments) which, when specified, will supersede te ones given in the script. In this example, the script would not be run with a runtime of 20 minutes as specified in the script, but with a runtime of 30 minutes. A complete list of options can be shown with sbatch -h and is also described in the SLURM documentation.

Interactive jobs

Sometimes an application may need user input while it is running, but also require so many resources that it needs to run on te cluster. For this purpose, the possibility to create interactive jobs exists. In an interactive job a console stays open and you can run any application within the job.

Caution: it is forbidden to run CPU-heavy applications on the login nodes because they affect all users negatively. The HPC team reserves the right to terminate such applications without prior warning.

An interactive job can be started with the following command:

$ srun --pty /bin/bash

The srun command queues a SLURM job to run a single command (by comparison, sbatch runs a single script). In this example, a Bash shell is started. The option --pty (“pseudo terminal”) makes the job interactive. The srun command accepts most of the same options about number of nodes, partition and runtime as sbatch. All options can be displayed with srun --help.

Tips:

  • It is not necessary to run a shell here, srun may run any Linux command.
  • If an interactive job is queued, the console will be stuck until it starts. If you want to continue working on the cluster until the job starts, the easiest option is to open a second console with an SSH connection to the cluster.

Job monitoring and cancelling

SLURM offers a number of commands for monitoring jobs. Only the most important ones are listed here, a more extensive list is here. Usually, the options for each command can be displayed with --help.

You can see which jobs are queued by typing squeue. This can be refined further, for example with squeue -p <queue name> only a specific partition (queue) is listed, with squeue -u <user name> only a specific user’s jobs are shown.

The status of the cluster nodes (idle, busy, etc.) can be shown with sinfo, information about the configuration of the queues with spartition. More detailed information for an individual job will be displayed when entering scontrol show job <Job ID>.

A queued job can be cancelled with scancel <Job ID>, independently of whether it already started running or not. All your own jobs can be cancelled with scancel -u <Your username>.

Tips

You have quite a bit of control over which resources you can request and it is advisable that you think about it ahead of time.

  • For example, if your application does not run in parallel but needs the entire RAM of a node, you can reserve a complete node.
    $ sbatch --nodes 1 <Scriptname>

    Note that you could also use -N which is a shorthand for --nodes.

  • When starting multiple processes, you also have to specify multiple tasks. For example for MPI programs one task equals one process and a program with 24 MPI processes can be queued with
    $ sbatch --ntasks 24 <Scriptname>

    and within the job script mpirun -np 24 needs to be specified. The option --ntasks has the shorthand -n. Alternatively, the number of nodes and tasks can be specified:

    $ sbatch -N 2 --tasks-per-node 12 <Scriptname>

    In this example, like in the previous one, 24 tasks would be started. But now it would be guaranteed that these tasks are all on the same 2 nodes. You cannot start more tasks on a node than it has CPUs and SLURM will display an error message when queuing such a job.

  • When a program with multiple threads is to be run, for example when OpenMP is used, only one task is used for the program. However one task can also be allocated multiple CPUs.
    $ sbatch --nodes 1 --cpus-per-task 12 <Scriptname>

    In this example, a complete node would be used for a single OpenMP program.

  • When you need to run a program multiple times in parallel within a job, you can do that with srun. Within job scripts, srun works differently than if you were to enter it in the console: no new job is queued, but the command is run multiple times within the existing job. When you have an MPI program however, an mpirun is necessary instead of srun. As mentioned above, SLURM will create environment variables that contain job information. For example, $SLURM_NTASKS contain the number of tasks available to the current job. An MPI program can then be called in the job script with
    $ mpirun -np $SLURM_NTASKS <Programmname>

    and it is run with the correct number of MPI processes.

    Caution: the variable SLURM_NTASKS is only set when the number of tasks is explicitly specified (e.g. via --ntasks-per-node or --ntasks), either in the job script or when queueing the job.

Aktualisiert um 17:36 am 12. August 2018 von Jan Philipp Stephan