Load the hpcx module:
$ module load gcc-7.3; module load hpcx
$ module list
You should see at least the two modules, gcc-7.3
and
hpcx listed.
Retrieve the files for the MPI portion by clicking
this link
or by copying them
on ManeFrame II at the command line:
$ cp /hpc/examples/workshops/hpc/session15_MPI.tgz .
Then, retrieve the files for the hybrid MPI+OpenMP portion by
clicking this link
or by
copying them on ManeFrame II at the command line:
$ cp /hpc/examples/workshops/hpc/session15_Hybrid.tgz .
Unpack the source files for the MPI portion of this tutorial as usual,
$ tar -zxf session15_MPI.tgz
Unlike OpenMP, MPI is implemented as a standalone library. This means that MPI merely consists of functions that you may call within your own programs to perform message passing within a distributed memory parallel computation.
Typically written in C (like the Linux kernel, for maximum portability), MPI libraries typically include interfaces for programs written in C, C++, Fortran77, Fortran90 and even Python.
Moreover, since MPI is a library, it does not require any specific compiler extensions to construct a MPI-enabled parallel program, so as long as you have any “standard” compiler for these languages, you can have a functioning MPI installation.
In order to compile a program to use any given software library, a few key items must be known about how the library was installed on the system:
-I
argument.-L
and -l
arguments.For example, the MVAPICH2 MPI library built using the GNU version
4.9.1 compiler, is installed on ManeFrame II in the directory
/hpc/mvapich2/2.0/gcc-4.9.1/
, with header files located in
/hpc/mvapich2/2.0/gcc-4.9.1/include/
and library files
located in /hpc/mvapich2/2.0/gcc-4.9.1/lib/
. Without me
telling you that, how easy do you think it would be to find these on
your own?
Finally, because I’m familiar with this package, I know that to
compile an executable I must link against the library files
libmpich.a
and libmpl.a
in this library directory location.
As a result, we could compile the executable driver.exe
with the
commands
$ g++ driver.cpp -I/hpc/mvapich2/2.0/gcc-4.9.1/include \
-L/hpc/mvapich2/2.0/gcc-4.9.1/lib -lmpich -lmpl -lm -o driver.exe
Clearly, specifying the specific instructions for including and linking to an MPI library can be nontrivial:
Thankfully, MPI library writers typically include wrapper scripts to do most of this work for you. Such scripts are written to encode all of the above information that is required to use MPI with a given compiler on a specific system.
Depending on your programming language and the specific MPI implementation, these wrapper scripts can have different names. The typical names for these MPI wrapper scripts for all MPI libraries installed on ManeFrame II are:
mpicxx
or mpic++
mpicc
mpif90
mpif77
(typically, the Fortran 90/95 wrapper will also work for these)In order to use these wrapper scripts on ManeFrame II, we must first load the correct module environment. Many are available:
$ module avail
Do you see how many of these available modules include the names
mpich2
, mvapich2
and openmpi
? Each of these modules will
enable the wrapper scripts for a different MPI library and compiler.
As we mentioned at the beginning of this tutorial, today we’ll focus on using the GNU compiler with the Infiniband-optimized MVAPICH2 MPI library. Ensure that you still have the module hpcx loaded:
$ module load gcc-7.3; module load hpcx
$ module list
This installation provides the MPI wrapper scripts mpicc
,
mpicxx
, mpif90
and mpif77
.
We may now use the corresponding C++ MPI wrapper script, along with the (much simpler) compilation line
$ mpicxx driver.cpp -lm -o driver.exe
to build the executable.
When running jobs on a dedicated parallel cluster (or a single workstation), parallel jobs and processes are not regulated through a queueing system. This has some immediate benefits:
However, dedicated clusters also have some serious deficiencies:
However, running parallel programs on such a system can be very simple, though the way that you run these jobs will depend on which MPI implementation you are using.
Since ManeFrame II is a large-scale, shared computing resource, we use the SLURM queueing system to manage user jobs. However, in some instances it may be useful to run MPI jobs interactively on ManeFrame II. This can be especially useful when debugging or testing a new code to ensure that it functions correctly, before submitting larger-scale or longer-running jobs to the queueing system.
On ManeFrame II, we should only run interactive MPI jobs by requesting
them through the batch system. This may be accomplished with the
srun
command. Recall the two srun options,
-N
and -n
, that request a specified number of nodes and
tasks. To request an interactive session on an entire workshop node
(with up to 8 tasks on that node), for a maximum of 10 minutes, issue
the command:
$ srun -I -N 1 -n 8 -t 10 -p workshop --x11=first --pty $SHELL
If this command completes successfully, you should note that you are in a new shell with a different hostname:
$ hostname
(this should no longer be mflogin01.hpc.smu.edu
or
mflogin02.hpc.smu.edu
; maybe something like
mfc0321.hpc.smu.edu
. Here, we can launch an MPI program
interactively using the program mpiexec
as if we were launching it
on our own workstation. The calling syntax of mpiexec
is
mpiexec [mpiexec_options] program_name [program_options]
The primary mpiexec
option that we use is -n #
, where #
is
the desired number of MPI processes to use in running the parallel job.
First, run the program using 1 process:
$ mpiexec -n 1 ./driver.exe
Run the program using 2 processes:
$ mpiexec -n 2 ./driver.exe
Run the program using 4 processes:
$ mpiexec -n 4 ./driver.exe
All of these will run the MPI processes on separate cores of our currently-reserved worker node.
Since the ManeFrame II nodes have 8 physical CPU cores, we are limited to
requesting at most 8 tasks in the srun
command, and to launching
at most 8 tasks in the subsequent mpiexec
command.
Running MPI batch jobs on ManeFrame II is almost identical to running serial and OpenMP batch jobs. However, when running MPI jobs, we must tell the queueing system a few additional pieces of information:
We have two key ways to control execution of parallel batch jobs:
The job reservation corresponds with the options specified with the
#SBATCH
prefix in the job submission file. These tell SLURM about
the resources you wish to reserve. Here, the most relevant options
are:
#SBATCH -N <NumNodes>
#SBATCH -n <NumTasks>
#SBATCH --ntasks-per-node=<NumLoad>
#SBATCH --exclusive
These options signify:
-N <NumNodes>
– This requests that <NumNodes>
nodes be reserved
for this job. The request should not exceed the total number of
nodes configured in the partition, otherwise the job will be rejected.-n <NumTasks>
or --ntasks=<NumTasks>
– This requests that
the job will launch a maximum of <NumTasks>
tasks.--ntasks-per-node=<NumLoad>
– This requests that a maximum of
<NumLoad>
cores should be used on each node.--exclusive
– This requests that the allocated nodes not be
shared with other users.Clearly, if you specify a value of NumTasks
that is more than 8x
larger than your value of NumNodes
it will cause an error, since
you will be requesting more cores tasks than you have requested
physical processes.
The job execution corresponds with the command that you actually use to launch the MPI job. Here, we may launch the job inside of our submission script with the usual
mpiexec <executable>
call; however it is preferrable to replace mpiexec
here with srun
:
srun <executable>
or even
srun -n <NumProcesses> <executable>
The first two of these are essentially equivalent (unless one of the
MPI tasks fails for some reason, in which case srun
will exit
gracefully whereas mpiexec
may not).
The third option is somewhat different, as it launches the MPI job
to use precisely <NumProcesses>
MPI execution threads. This value
must not exceed the total reservation size, but it may be smaller.
Perhaps the easiest way to understand these options is through a series of examples.
Example 1: specifying the number of MPI tasks
The simplest way to launch an MPI job with SLURM is to just request a
specific number of MPI tasks with the -n
option:
#!/bin/bash
#SBATCH -n 12 # requested MPI tasks
#SBATCH -p workshop # requested queue
#SBATCH -t 1 # maximum runtime in minutes
srun ./driver.exe
When running this, my job ran with 12 total MPI tasks on two nodes,
with 8 tasks on the first node and 4 on the second. Since this does
not specify the --exclusive
option, some of these MPI tasks may be
launched on nodes shared with others; it’s even possible that one MPI
task will be launched on each of 12 different nodes that are shared by
others.
Example 2: specifying the number of MPI tasks (exclusive)
If we add in only the --exclusive
option, this changes the
behavior slightly:
#!/bin/bash
#SBATCH -n 12 # requested MPI tasks
#SBATCH -p workshop # requested queue
#SBATCH -t 1 # maximum runtime in minutes
#SBATCH --exclusive # do not share nodes
srun ./driver.exe
This job will run always with 12 total MPI tasks on two nodes, distributed evenly with 6 nodes on each, and the other 2 cores/node were unused.
Example 3: filling a specified portion of each node
For some jobs that require significant amounts of memory,
8 MPI tasks may require too much memory for each node. In this case
we may wish to reserve nodes based on memory capacity, and only launch
a few MPI tasks per node. This is where the ntasks-per-node
option comes in handy:
#!/bin/bash
#SBATCH -N 2 # requested nodes
#SBATCH --ntasks-per-node=2 # task load per node
#SBATCH -p workshop # requested queue
#SBATCH -t 1 # maximum runtime in minutes
#SBATCH --exclusive # do not share nodes
srun ./driver.exe
This job will run with 4 total MPI tasks, but now with 2 tasks on each of the 2 nodes.
Example 4: filling a specified portion of each node (revisited)
An alternate way to perform the same kind of run would be to specify the total number of MPI tasks, along with the load you want on each node:
#!/bin/bash
#SBATCH -n 15 # requested MPI tasks
#SBATCH --ntasks-per-node=4 # maximum task load per node
#SBATCH -p workshop # requested queue
#SBATCH -t 1 # maximum runtime in minutes
#SBATCH --exclusive # do not share nodes
srun ./driver.exe
As expected, this run uses 15 total MPI tasks, but where these are allocated so that each node has at most 4 tasks, leading to a submission in which three nodes run 4 MPI tasks, and a fourth node runs only 3 MPI tasks.
However, it is recommended that you also specify -N
for such jobs
so that the queueing system knows at time of submission how many
total nodes will eventually be needed, e.g.
#!/bin/bash
#SBATCH -N 4 # requested nodes
#SBATCH -n 15 # requested MPI tasks
#SBATCH --ntasks-per-node=4 # maximum task load per node
#SBATCH -p workshop # requested queue
#SBATCH -t 1 # maximum runtime in minutes
#SBATCH --exclusive # do not share nodes
srun ./driver.exe
Interestingly, when running the former of these two approaches, before
the job was launched, squeue
reported that the job would require 2
nodes instead of the full 4. While the code still ran, I may have
just gotten lucky.
Compile the executable driver.exe
using the GNU compilers.
Set up submission scripts to run this executable using 1, 2, 4, 8, 16, 32 and 64 cores. For the 1, 2, 4, and 8 processor jobs, just use one node. Run the 16, 32 and 64 processor jobs using 8 cores per node.
Determine the parallel speedup when running this code using MPI. Does it speed up optimally (i.e. by a factor of 64)?