Parallelization

A cluster can only be used effectively if many tasks can run on it at the same time. This is easy if there is a large number of small separated problems (trivially parallelizable). However, if a single large problem with a lot of interdependent data needs to be solved, a large effort has to be put into parallelization of the software.

A parallel program is one that is able to perform multiple tasks at the same time.

Parallelizing software requires advanced programming knowledge which this website alone cannot provide to the full extent. Only a few concepts and terms are introduced which are intended to give you some basic pointers when you plan to develop parallel software, parallelize existing software or optimize software.

ZIMT offers in-person support to members of Uni Siegen when parallelizing and optimizing their scientific software. If you would like consulting, you can send an e-mail to hpc-support@uni-siegen.de or come to our consultation hour.

Basic parallel programming terms

In the field of high performance computing (HPC), a lot of scientific software is written by the scientists themselves. If you want to write such software, a few basic terms are explained here.

Task and data parallelism

If a lot of tasks are to be run at the same time, it is called task parallelism. If the same operation is to be performed on a large amount of data, it is called data parallelism. In scientific computing the latter case is especially common, for example when the movement of large numbers of particles is to be calculated.

Processes and threads

As described on the Linux basics page, a lot of processes run in parallel on a Linux system. However a single process can also launch multiple threads which may work on separate tasks at the same time.

Shared and distributed memory

The term shared memory parallelization describes a situation where tasks running in parallel use the same section of memory (RAM), typically because they are part of one process (like a process with multiple threads). Since each node has its own RAM on the OMNI cluster, shared memory implies that a process can launch up to 64 threads on a normal compute node and is limited to that one node and its RAM.

If that is not sufficient, a distributed memory parallelization is necessary. Here, a separate process is launched for each parallel task and each process has its own section of RAM. The advantage is that the processes do not necessarily have to run on the same node. In addition to the challenges of parallelization, the developer needs to decide which processes need to communicate what data at which time.

The two variants can also be combined, this is termed hybrid parallelization.

Programming languages and software

Most programming languages support parallelism in some way, but not all are equally convenient for parallel programming. Also, many languages introduce a large overhead, meaning that in addition to the operation the user wants to do (e.g. add two arrays), additional operations are performed (e.g. creation of objects) or additional information needs to be stored in RAM (e.g. object metadata). Usually, comparatively low-level programming languages like C or Fortran are used when performance is critical, rather than languages with a high level of abstraction like Python or MATLAB (although the latter two are common in HPC when ease of use is more important). As a rule-of-thumb, a Fortran or C program can potentially be two orders of magnitude faster than a Python program implementing the same algorithm (“potentially” because the program needs to be written in a way that the advantage is actually exploited). For this reason, Fortran in particular is still popular in HPC, while being considered obsolete in most other fields.

There are two especially important software tools for parallel programming, MPI and OpenMP. Both are available in the form of libraries and are the most common parallelization methods. They are particularly suited to C/C++ and Fortran programs, but interfaces to many other languages exist as well (e.g. mpi4py for Python).

MPI

MPI stands for Message Passing Interface and is a standard for distributed memory parallelization. Several implementations of this standard exist, the most common ones are Open MPI (not to be confused with OpenMP) and Intel MPI, both of which are available on the cluster. MPI contains a number of functions which can be used to communicate data to other processes. For example, an MPI_Send function can send data from a specific process to another one, while MPI_Scatter can distribute data to multiple processes. The difficulty for the developer lies in deciding which information needs to be communicated when. A general introduction to MPI in PDF form can be found here.

OpenMP

OpenMP (MP stands for multi-processing) is an application programming interface (API) for programs that use multiple threads (shared memory parallelization). OpenMP also exists for both C and Fortran, in both cases parallel tasks in the code are marked with preprocessor directives. For example, in a C program #pragma omp parallel for may be inserted above a for-loop. When code execution reaches this line, a number of threads will be created, data and tasks will be distributed among them and at the end, the data will be joined again (fork-join model). OpenMP is generally a little easier to understand and to implement than MPI. An OpenMP introduction can be found here.

Optimization and efficiency

When optimizing software, one should follow the rule “no optimization without performance measurement”. If your software runs too long or needs too much memory, first you should determine where in the code that happens. There are several tools for performance measurement, also called “profilers”, for example gprof, TAU or Scalasca. The first is a common and widely used general purpose tool, while the other two are specifically for HPC purposes. All three are available on the cluster.

All of these considerations concern the so-called serial efficiency, i.e. the efficiency of an individual process. In contrast, the parallel efficiency is a measure of how well your software is parallelized. For example, if one of your processes has more work to do and the others are idle, that is a problem of parallel efficiency. When optimizing software, it is recommended that you look at serial efficiency first.

Here are some more issues one might need to think about in relation to serial efficiency:

Overhead: As described above, a task may require additional operations that might not be immediately apparent. Profiling your code will tell you how much time is spent on what.
Scaling: In addition to the number of operations that need to be performed for each data point in your computation, you also need to worry about how the number of operations scales with your data points. For example, if you need to create a matrix for your N data points, then N * N operations need to be performed to fill your matrix. In other words, no matter how simple these N * N operations are, the simple fact that you need to perform 4 times as many of them as you double N affects your performance. The larger your problem size gets, the more you need to think about scaling as opposed to the performance of the individual operations. This is also termed algorithmic complexity. For example, an algorithm which needs to perform N squared operations is said to have second-order complexity.
Cache size: Modern CPUs will load data from the RAM into small units memory on the CPU called caches. This takes time, and so does writing the data back to RAM after the operations are done. The more operations can be performed solely with data in the cache without having to access the RAM to exchange the cache’s contents, the better.

Here are some issues relating to parallel efficiency:

Load balancing: The longer some of your processes/threads are idle, the less efficient your code.
Deadlocks: It is possible, especially with MPI communication, to create a situation where multiple processes are waiting for data from one another and your code will be stuck for an infinite time.
Race conditions: The term race condition describes a situation where a piece of data is read from and written to by multiple processes without any control when that happens. In such a case, the order in which the processes read and write, which is determined by chance, changes the result and may introduce errors that are hard to detect.

This list is by no means exhaustive.

Aktualisiert um 15:51 am 8. February 2021 von Gerd Pokorra

Cluster-News