Graphical Processing Units are specialized processors that can execute some operations with their massively parallel architecture faster than regular CPUs. They are therefore also called accelerators (GPUs are only a subset of accelerators). The OMNI cluster contains 10 nodes with a total of 24 NVIDIA Tesla V100 GPUs.

If a program is to use GPUs, the parts of the program that are most suited for this need to be modified with functions from a GPU library (e.g. CUDA, OpenACC, OpenCL, OpenMP).

On this page we describe how you can request the use of GPUs for your jobs. We also describe which software libraries are available for developing software for GPUs and how to use these libraries.

Requesting GPU nodes on OMNI

To request a GPU node, you need to specify the gpu queue (partition) in your job script. Additionally, the number of required GPUs needs to be specified with the option --gres=gpu:<number> .

The 24 GPUs on the OMNI cluster are distributed over the 10 GPU nodes as follows:

Node Number
of GPUs
gpu-node[001-004] 4
gpu-node[005-008] 1
gpu-node[009-010] 2

Here is an example job script header:

#!/bin/bash
#SBATCH --time=0:30:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --partition=gpu
#SBATCH --gres=gpu:2
...

This will allocate one GPU node with at least two GPUs from the GPU queue. You may vary --grep=gpu:[1|2|4], depending on your needs for multiple GPUs. You can also find a number of additional parameters for GPU control in the Slurm documentation.

Developing for GPUs on OMNI

Most GPU-compatible modules are not immediately available on OMNI (and cannot be listed with module avail right away), because they are located in a separate software stack, the GPU modules. This is necessary for compatibility reasons. To change to the GPU stack, you need to enter the following command:

module load GpuModules

Once the GPU stack is loaded, the command module avail will give you an overview of all available modules in the GPU stack, as usual.

To switch back to the regular software stack, please enter:

module unload GpuModules

Please remember that you need to include the appropriate commands for module loading into your job scripts as well.

GPU Sharding

It is now possible to use GPU Sharding to allow multiple Slurm jobs to use a GPU simultaneously on the OMNI Cluster. This enables a more efficient utilization of GPU resources. Per GPU node 64 Shards are available therefore up to 64 jobs can use a GPU in parallel. Shards can be requested with the slurm parameter --gres=shard:<number of shards>. Only request as many shards as your calculation requires.

Example Job-Skript-Header for 2 shards:

#SBATCH --time=0:30:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --partition=gpu
#SBATCH --gres=shard:2
…

Please be aware that the number of available shards per gpu node is 64 for all gpu nodes independent of the number of GPUs on the node. This means a the GPU on a node with one GPU can be divided in 64 shards but on a node with two GPUs each GPU can be divided in 32 shards. This means 1 Shard on a node with more GPUs has more performance. If you want predetermine to the number of GPUs on the node while using shards you can --exclude to exlude the nodes you don’t want to use.

Example Job-Skript-Header for nodes with 1 GPU:

#SBATCH --time=0:30:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --partition=gpu
#SBATCH --gres=shard:2
#SBATCH --exclude=gpu-node[001-004],gpu-node[009-010]
…
  • 1 GPU: --exclude=gpu-node[001-004],gpu-node[009-010]
  • 2 GPUs: --exclude=gpu-node[001-008]
  • 4 GPUs: --exclude=gpu-node[005-010]

Additional notes:

  • If a single GPU is allocated with gres=gpu:1 on a node with two GPUs this is equivalent to 32 allocated shards.
  • With 64 Shards and 256 GB of memory, there is a little less then 4 GB memory per shard available.

Aktualisiert um 15:19 am 8. February 2021 von Jan Steiner