Graphical Processing Units are specialized processors that can execute some operations with their massively parallel architecture faster than regular CPUs. They are therefore also called accelerators (GPUs are only a subset of accelerators). The OMNI cluster contains 10 nodes with a total of 24 NVIDIA Tesla V100 GPUs.
If a program is to use GPUs, the parts of the program that are most suited for this need to be modified with functions from a GPU library (e.g. CUDA, OpenACC, OpenCL, OpenMP).
To request a GPU node, you need to specify the
gpu queue (partition) in your job script. Additionally, the number of required GPUs needs to be specified with the option
The 24 GPUs on the OMNI cluster are distributed over the 10 GPU nodes as follows:
Here is example job script header:
#!/bin/bash #SBATCH --time=0:30:00 #SBATCH --nodes=1 #SBATCH --tasks-per-node=2 #SBATCH --partition=gpu #SBATCH --gres=gpu:2 ...
This will allocate one GPU node with at least two GPUs from the GPU queue. You may vary
--grep=gpu:[1|2|4], depending on your needs for multiple GPUs.
After login, the OpenHPC and OMNI software stack are loaded by default. The OpenHPC Software Stack includes various tools and libraries for scientific computing on HPC systems. The OMNI software stack comprises application software installed by user request. We additionally provide a complete software stack from the Bright Cluster Manager, what provides GPU compiler and packages with GPU support tailored for maschine learning applications. After loading the GPU stack, the common stacks will be invisible to avoid incompatibility.
Load the GPU software stack:
module load GpuModules
This fill add the following components to your environment per default:
$ module list Currently Loaded Modules: 1) shared 6) openblas/dynamic/0.3.7 11) nccl2-cuda10.1-gcc/2.7.8 2) slurm/omni/20.02.6 7) protobuf3-gcc/3.8.0 12) gcc5/5.5.0 3) python36 8) cudnn7.6-cuda10.1/22.214.171.124 13) tensorflow2-py36-cuda10.1-gcc/2.0.0 4) ml-pythondeps-py36-cuda10.1-gcc/3.3.0 9) hdf5_18/1.8.21 5) keras-py36-cuda10.1-gcc/2.3.1 10) cuda10.1/toolkit/10.1.243
module avail gives you an overview of all available modules in GpuModules from the Bright Cluster Manager Maschine Learning Stack. After unloading the GPU software stack, the default environment with the OpenHPC and OMNI software stack are loaded. The corresponding command is:
module unload GpuModules
Please remember to include the module commands into your submission script and allocate GPUs (see above).