As seen in previously, we may use ManeFrame II for parallel computing (in addition to running serial jobs). In the remainder of this tutorial, we will differentiate between running shared-memory parallel (SMP) programs, typically enabled by OpenMP, and distributed-memory parallel (DMP) programs, typically enabled by MPI. Hybrid MPI+OpenMP programming is also possible on ManeFrame II, and we will end this tutorial session with a brief study of those.
We have also seen that we have the choice between GNU, PGI, Intel compilers when compiling our codes on ManeFrame II. In the tutorial below, we will focus on use of the GNU compilers; use of the PGI and Intel compilers is similar.
Since we’ll be using the GNU compiler throughout this tutorial load the
gcc-7.3
module:
$ module load gcc-7.3
$ module list
Second, you will need to retrieve sets of files for the OpenMP, MPI
and hybrid MPI+OpenMP portions of this session. Retrieve the files
for the OpenMP portion by clicking this link
or by copying them on ManeFrame II at the
command line:
$ cp /hpc/examples/workshops/hpc/session14.tgz .
We may run shared-memory programs on any ManeFrame II worker node. All ManeFrame II worker nodes have 8 CPU cores. In my experience, shared-memory programs rarely benefit from using more execution threads than the number of physical cores on a node, so I recommend that SMP jobs use at most 8 threads, though your application may act differently.
OpenMP is implemented as an extension to existing programming languages, and is available for programs written in C, C++, Fortran77 and Fortran90. These OpenMP extensions are enabled at the compiler level, with most compilers supporting OpenMP. In these compilers, OpenMP is enabled through supplying a flag to the relevant compiler denoting that you wish for it to allow the OpenMP extensions to the existing language. The various compiler flags for well-known compilers include:
-mp
-fopenmp
-openmp
-qsmp
-xopenmp
-openmp
-openmp
Before proceeding to the following subsections, unpack the OpenMP portion of this tutorial using the usual commands:
$ tar -zxf session14.tgz
$ cd session14
In the resulting directory, you will find a number of files, including
Makefile
, driver.cpp
and vectors.cpp
.
You can compile the executable driver.exe
with the GNU compiler
and OpenMP using the command
$ g++ -fopenmp driver.cpp vectors.cpp -lm -o driver.exe
The compiler option -fopenmp
is the same, no matter which GNU
compiler you are using (gcc
, gfortran
, etc.)
Note
The only difference when using the PGI compilers is the compiler name and OpenMP flag, e.g.
$ pgc++ -mp driver.cpp vectors.cpp -lm -o driver.exe
Run the executable driver.exe
from the command line:
$ ./driver.exe
Depending on your default setup, you will have either used 1 or 8 threads.
To control the number of threads used by our program, we must adjust
the OMP_NUM_THREADS
environment variable. First, check your
current default value (it may be blank):
$ echo $OMP_NUM_THREADS
The method for re-setting this environment variable will depend on our login shell. First, determine which login shell you use:
$ echo $SHELL
For CSH/TCSH users, you can set your OMP_NUM_THREADS
environment
variable to 2 with the command:
$ setenv OMP_NUM_THREADS 2
the same may be accomplished by BASH/SH/KSH users with the command:
$ export OMP_NUM_THREADS=2
Re-run driver.exe
first using 1 and then using 3 OpenMP
threads. Notice the speedup when running with multiple threads. Also
notice that although the result, Final rms norm
is essentially the
same in both runs, the results differ slightly after around the 11th
digit. The reasoning is beyond the scope of this tutorial, but in
short this results from a combination of floating-point roundoff
errors and differences in the order of arithmetic operations. The
punch line being that bitwise identicality between runs is difficult
to achieve in parallel computations, and in any case may not be
necessary in the first place.
To run OpenMP-enabled batch job, the steps are identical to those
required for requesting an exclusive node,
except that we must additionally specify the environment variable
OMP_NUM_THREADS
. It is recommended that this variable be supplied
inside the batch job submission file to ensure reproducibility of
results.
Create a batch job submission file:
#!/bin/bash
#SBATCH -J test1 # job name
#SBATCH -o test1.txt # output/error file name
#SBATCH -p workshop # requested queue
#SBATCH --exclusive # do not share the compute node
#SBATCH -t 1 # maximum runtime in minutes
# set the desired number of OpenMP threads
export OMP_NUM_THREADS=7
# run the code
./driver.exe
Recall, the --exclusive
option indicates that we wish to run the
job on an entire node (without sharing that node with others). This
is critical for SMP jobs, since each SMP job will launch multiple
threads of execution, so we do not want to intrude on other users by
running threads on their CPU cores!
Furthermore, note that once the job is launched, it will use 7 of the 8 available hardware threads on that node, implying that one core will remain idle.
Note
In fact, each worker node does much more than just run your job (runs the operating system, handles network traffic, etc.), so in many instances SMP jobs run faster when using \(N-1\) threads than when using \(N\) threads, where \(N\) is the number of CPU cores, since this leaves one core to handle all remaining non-job duties.
Compile the program driver.exe
using the GNU compiler with OpenMP
enabled.
Create a single SLURM submission script that will run the program
driver.exe
using 1, 2, 3, …, 12 OpenMP threads on ManeFrame II’s
parallel partition. Recall from session 5 that you may embed
multiple commands within your job submission script.
Launch this job, and when it has completed, determine the parallel efficiency (i.e. strong scaling performance) of this code (defined in session 6, parallel_computing_metrics). How well does the program perform? Is there a maximum number of threads where, beyond which, additional resources no longer improve the speed?
Note
If you finish this early, perform the same experiment but this time using the PGI compiler. How do your results differ?