Session 15: Distributed Memory Parallel Programming with MPI

Distributed-memory programs

Getting started

Load the hpcx module:

$ module load gcc-7.3; module load hpcx
$ module list

You should see at least the two modules, gcc-7.3 and hpcx listed.

Retrieve the files for the MPI portion by clicking this link or by copying them on ManeFrame II at the command line:

$ cp /hpc/examples/workshops/hpc/session15_MPI.tgz .

Then, retrieve the files for the hybrid MPI+OpenMP portion by clicking this link or by copying them on ManeFrame II at the command line:

$ cp /hpc/examples/workshops/hpc/session15_Hybrid.tgz .

MPI overview

Unpack the source files for the MPI portion of this tutorial as usual,

$ tar -zxf session15_MPI.tgz

Unlike OpenMP, MPI is implemented as a standalone library. This means that MPI merely consists of functions that you may call within your own programs to perform message passing within a distributed memory parallel computation.

Typically written in C (like the Linux kernel, for maximum portability), MPI libraries typically include interfaces for programs written in C, C++, Fortran77, Fortran90 and even Python.

Moreover, since MPI is a library, it does not require any specific compiler extensions to construct a MPI-enabled parallel program, so as long as you have any “standard” compiler for these languages, you can have a functioning MPI installation.

Compiling MPI code

MPI wrapper scripts

In order to compile a program to use any given software library, a few key items must be known about how the library was installed on the system:

  • Does the library provide header files (C, C++) or modules (F90), and where are these located? This location is important because when compiling our own codes, we must tell the compiler where to look for these “include files” using the -I argument.
  • If the library was installed in a non-default location, where is the resulting “.a” file (static library) or “.so” file (shared library) located? Again, this location is important because when linking our own codes, we must tell the compiler where to look for these library files using the -L and -l arguments.

For example, the MVAPICH2 MPI library built using the GNU version 4.9.1 compiler, is installed on ManeFrame II in the directory /hpc/mvapich2/2.0/gcc-4.9.1/, with header files located in /hpc/mvapich2/2.0/gcc-4.9.1/include/ and library files located in /hpc/mvapich2/2.0/gcc-4.9.1/lib/. Without me telling you that, how easy do you think it would be to find these on your own?

Finally, because I’m familiar with this package, I know that to compile an executable I must link against the library files libmpich.a and libmpl.a in this library directory location.

As a result, we could compile the executable driver.exe with the commands

$ g++ driver.cpp -I/hpc/mvapich2/2.0/gcc-4.9.1/include \
  -L/hpc/mvapich2/2.0/gcc-4.9.1/lib -lmpich -lmpl -lm -o driver.exe

Clearly, specifying the specific instructions for including and linking to an MPI library can be nontrivial:

  • You must know where all of the relevant libraries are installed on each computer.
  • You must know which specific library files are required for compiling a given program.
  • Sometimes, you must even know which order you need to specify these specific library files in the linking line.

Thankfully, MPI library writers typically include wrapper scripts to do most of this work for you. Such scripts are written to encode all of the above information that is required to use MPI with a given compiler on a specific system.

Depending on your programming language and the specific MPI implementation, these wrapper scripts can have different names. The typical names for these MPI wrapper scripts for all MPI libraries installed on ManeFrame II are:

  • C++: mpicxx or mpic++
  • C: mpicc
  • Fortran 90/95/2003: mpif90
  • Fortran 77: mpif77 (typically, the Fortran 90/95 wrapper will also work for these)

In order to use these wrapper scripts on ManeFrame II, we must first load the correct module environment. Many are available:

$ module avail

Do you see how many of these available modules include the names mpich2, mvapich2 and openmpi? Each of these modules will enable the wrapper scripts for a different MPI library and compiler.

As we mentioned at the beginning of this tutorial, today we’ll focus on using the GNU compiler with the Infiniband-optimized MVAPICH2 MPI library. Ensure that you still have the module hpcx loaded:

$ module load gcc-7.3; module load hpcx
$ module list

This installation provides the MPI wrapper scripts mpicc, mpicxx, mpif90 and mpif77.

We may now use the corresponding C++ MPI wrapper script, along with the (much simpler) compilation line

$ mpicxx driver.cpp -lm -o driver.exe

to build the executable.

Running MPI code

Running MPI interactive jobs

When running jobs on a dedicated parallel cluster (or a single workstation), parallel jobs and processes are not regulated through a queueing system. This has some immediate benefits:

  • You never have to wait to start running a program.
  • It is trivial to set up and run parallel jobs.
  • You have complete control over which processors are used in a parallel computation.

However, dedicated clusters also have some serious deficiencies:

  • A single user can monopolize all of the available resources.
  • More than one job can be running on a processor at a time, so different processes must fight for system resources (giving unreliable timings or memory availability).
  • The more users there are, the worse these problems become.

However, running parallel programs on such a system can be very simple, though the way that you run these jobs will depend on which MPI implementation you are using.

Since ManeFrame II is a large-scale, shared computing resource, we use the SLURM queueing system to manage user jobs. However, in some instances it may be useful to run MPI jobs interactively on ManeFrame II. This can be especially useful when debugging or testing a new code to ensure that it functions correctly, before submitting larger-scale or longer-running jobs to the queueing system.

On ManeFrame II, we should only run interactive MPI jobs by requesting them through the batch system. This may be accomplished with the srun command. Recall the two srun options, -N and -n, that request a specified number of nodes and tasks. To request an interactive session on an entire workshop node (with up to 8 tasks on that node), for a maximum of 10 minutes, issue the command:

$ srun -I -N 1 -n 8 -t 10 -p workshop --x11=first --pty $SHELL

If this command completes successfully, you should note that you are in a new shell with a different hostname:

$ hostname

(this should no longer be mflogin01.hpc.smu.edu or mflogin02.hpc.smu.edu; maybe something like mfc0321.hpc.smu.edu. Here, we can launch an MPI program interactively using the program mpiexec as if we were launching it on our own workstation. The calling syntax of mpiexec is

mpiexec [mpiexec_options] program_name [program_options]

The primary mpiexec option that we use is -n #, where # is the desired number of MPI processes to use in running the parallel job.

First, run the program using 1 process:

$ mpiexec -n 1 ./driver.exe

Run the program using 2 processes:

$ mpiexec -n 2 ./driver.exe

Run the program using 4 processes:

$ mpiexec -n 4 ./driver.exe

All of these will run the MPI processes on separate cores of our currently-reserved worker node.

Since the ManeFrame II nodes have 8 physical CPU cores, we are limited to requesting at most 8 tasks in the srun command, and to launching at most 8 tasks in the subsequent mpiexec command.

Running MPI batch jobs

Running MPI batch jobs on ManeFrame II is almost identical to running serial and OpenMP batch jobs. However, when running MPI jobs, we must tell the queueing system a few additional pieces of information:

  1. How many total nodes we want to reserve on the machine?
  2. How many total cores do we want to reserve on the machine?
  3. How do you want to distribute tasks on each node?
  4. How many MPI tasks do you actually want to run?

We have two key ways to control execution of parallel batch jobs:

  • controlling how the job is reserved
  • controlling how the MPI job is executed
MPI batch job reservations

The job reservation corresponds with the options specified with the #SBATCH prefix in the job submission file. These tell SLURM about the resources you wish to reserve. Here, the most relevant options are:

#SBATCH -N <NumNodes>
#SBATCH -n <NumTasks>
#SBATCH --ntasks-per-node=<NumLoad>
#SBATCH --exclusive

These options signify:

  • -N <NumNodes> – This requests that <NumNodes> nodes be reserved for this job. The request should not exceed the total number of nodes configured in the partition, otherwise the job will be rejected.
  • -n <NumTasks> or --ntasks=<NumTasks> – This requests that the job will launch a maximum of <NumTasks> tasks.
  • --ntasks-per-node=<NumLoad> – This requests that a maximum of <NumLoad> cores should be used on each node.
  • --exclusive – This requests that the allocated nodes not be shared with other users.

Clearly, if you specify a value of NumTasks that is more than 8x larger than your value of NumNodes it will cause an error, since you will be requesting more cores tasks than you have requested physical processes.

MPI batch job execution

The job execution corresponds with the command that you actually use to launch the MPI job. Here, we may launch the job inside of our submission script with the usual

mpiexec <executable>

call; however it is preferrable to replace mpiexec here with srun:

srun <executable>

or even

srun -n <NumProcesses> <executable>

The first two of these are essentially equivalent (unless one of the MPI tasks fails for some reason, in which case srun will exit gracefully whereas mpiexec may not).

The third option is somewhat different, as it launches the MPI job to use precisely <NumProcesses> MPI execution threads. This value must not exceed the total reservation size, but it may be smaller.

Perhaps the easiest way to understand these options is through a series of examples.

MPI batch file examples

Example 1: specifying the number of MPI tasks

The simplest way to launch an MPI job with SLURM is to just request a specific number of MPI tasks with the -n option:

#!/bin/bash
#SBATCH -n 12                # requested MPI tasks
#SBATCH -p workshop          # requested queue
#SBATCH -t 1                 # maximum runtime in minutes

srun ./driver.exe

When running this, my job ran with 12 total MPI tasks on two nodes, with 8 tasks on the first node and 4 on the second. Since this does not specify the --exclusive option, some of these MPI tasks may be launched on nodes shared with others; it’s even possible that one MPI task will be launched on each of 12 different nodes that are shared by others.

Example 2: specifying the number of MPI tasks (exclusive)

If we add in only the --exclusive option, this changes the behavior slightly:

#!/bin/bash
#SBATCH -n 12                # requested MPI tasks
#SBATCH -p workshop          # requested queue
#SBATCH -t 1                 # maximum runtime in minutes
#SBATCH --exclusive          # do not share nodes

srun ./driver.exe

This job will run always with 12 total MPI tasks on two nodes, distributed evenly with 6 nodes on each, and the other 2 cores/node were unused.

Example 3: filling a specified portion of each node

For some jobs that require significant amounts of memory, 8 MPI tasks may require too much memory for each node. In this case we may wish to reserve nodes based on memory capacity, and only launch a few MPI tasks per node. This is where the ntasks-per-node option comes in handy:

#!/bin/bash
#SBATCH -N 2                 # requested nodes
#SBATCH --ntasks-per-node=2  # task load per node
#SBATCH -p workshop          # requested queue
#SBATCH -t 1                 # maximum runtime in minutes
#SBATCH --exclusive          # do not share nodes

srun ./driver.exe

This job will run with 4 total MPI tasks, but now with 2 tasks on each of the 2 nodes.

Example 4: filling a specified portion of each node (revisited)

An alternate way to perform the same kind of run would be to specify the total number of MPI tasks, along with the load you want on each node:

#!/bin/bash
#SBATCH -n 15                # requested MPI tasks
#SBATCH --ntasks-per-node=4  # maximum task load per node
#SBATCH -p workshop          # requested queue
#SBATCH -t 1                 # maximum runtime in minutes
#SBATCH --exclusive          # do not share nodes

srun ./driver.exe

As expected, this run uses 15 total MPI tasks, but where these are allocated so that each node has at most 4 tasks, leading to a submission in which three nodes run 4 MPI tasks, and a fourth node runs only 3 MPI tasks.

However, it is recommended that you also specify -N for such jobs so that the queueing system knows at time of submission how many total nodes will eventually be needed, e.g.

#!/bin/bash
#SBATCH -N 4                 # requested nodes
#SBATCH -n 15                # requested MPI tasks
#SBATCH --ntasks-per-node=4  # maximum task load per node
#SBATCH -p workshop          # requested queue
#SBATCH -t 1                 # maximum runtime in minutes
#SBATCH --exclusive          # do not share nodes

srun ./driver.exe

Interestingly, when running the former of these two approaches, before the job was launched, squeue reported that the job would require 2 nodes instead of the full 4. While the code still ran, I may have just gotten lucky.

MPI exercise

Compile the executable driver.exe using the GNU compilers.

Set up submission scripts to run this executable using 1, 2, 4, 8, 16, 32 and 64 cores. For the 1, 2, 4, and 8 processor jobs, just use one node. Run the 16, 32 and 64 processor jobs using 8 cores per node.

Determine the parallel speedup when running this code using MPI. Does it speed up optimally (i.e. by a factor of 64)?

Hybrid Shared/Distributed-memory programs

There is no reason why we cannot mix the two above parallelism approaches, using MPI to communicate between nodes, while using OpenMP to share computation by CPU cores within a node.

Unpack the source files for this portion of the tutorial as usual,

$ tar -zxf session15_Hybrid.tgz

Compiling Hybrid MPI+OpenMP code

We compile hybrid MPI+OpenMP programs by combining the two previous compilation strategies: MPI wrapper scripts plus OpenMP compiler flags.

Ensure that you still have the module hpcx loaded,

$ module load gcc-7.3; module load hpcx
$ module list

and compile the program with the command

$ mpicxx -fopenmp driver.cpp -o driver.exe

Running Hybrid MPI+OpenMP code

To run a hybrid MPI+OpenMP job, we similarly combine the two previous approaches: job submission that specifies the number of MPI tasks plus environment variables to specify the number of OpenMP threads per MPI task.

In setting up these jobs, we want to ensure two things:

  1. We clearly specify the number of MPI tasks per node and OpenMP threads per node so that we do not overcommit the available resources.
  2. We evenly balance the MPI tasks so that they are evenly distributed among reserved nodes (and not all lumped onto the first few nodes).

We can accomplish both of these goals through techniques that we already learned for evenly distributing MPI tasks on nodes to balance memory constraints.

For example, to run our hybrid MPI+OpenMP code using 4 nodes, with each node running a single MPI task but launching 8 OpenMP threads, we can use the following submission script:

#!/bin/bash
#SBATCH -N 4                 # requested nodes
#SBATCH --ntasks-per-node=1  # one MPI task per node
#SBATCH --exclusive          # do not share nodes
#SBATCH -p workshop          # requested queue
#SBATCH -t 1                 # maximum runtime in minutes

export OMP_NUM_THREADS=8
srun ./driver.exe

Hybrid MPI+OpenMP exercise

Compile the executable driver.exe to enable hybrid MPI+OpenMP parallelism using the GNU compilers.

Set up submission scripts to run this executable in the following ways:

  • 2 MPI tasks, 8 OpenMP threads each
  • 4 MPI tasks, 4 OpenMP threads each
  • 8 MPI tasks, 2 OpenMP thread each
  • 16 MPI tasks, 1 OpenMP thread each

Set up each of these to run on exactly 2 nodes.

All four of these experiments use the same number of CPU cores. Do some approaches outperform others?