Using ManeFrame

What is ManeFrame?

ManeFrame is a Linux cluster which has the ability to run jobs that span multiple processors at once (a.k.a. high performance computing or parallel computing), and/or its ability to run large numbers of simultaneous jobs (a.k.a. high throughput computing).

  • ManeFrame – Our newest member of HPC at SMU, Maneframe, was acquired in 2014 through an award by the US Department of Defense HPC Modernization Program to a group of faculty from across the SMU campus.

    While ManeFrame may be used for both high-throughput and parallel computing, it is built with a high-speed and low-latency communication network that connects all computing nodes, making it ideally suited for parallel computing. As a result, ManeFrame utilizes the SLURM job scheduler, that was designed from scratch for large-scale parallel computing (and is used by many of the most powerful supercomputers in the world).

ManeFrame hardware

First, let’s familiarize ourselves with the hardware that comprises the ManeFrame cluster. We can group the portions of ManeFrame into a few simple categories: login nodes, worker nodes and disk nodes.

Login nodes

Typically, users only directly interact with the “login” nodes on ManeFrame. These are the nodes that users log into when accessing ManeFrame, and are where users request other resources for running jobs.

  • The main login nodes are mflogin01.hpc.smu.edu and mflogin02.hpc.smu.edu. These are used for compiling source code, setting up job submission files, input files, transferring data to/from ManeFrame, etc.. These nodes should not be used for running computationally intensive calculations.
  • Additionally, mflogin03.hpc.smu.edu and mflogin04.hpc.smu.edu are not currently available, but will be provisioned in the near future.

Worker nodes

The worker, or compute, nodes are where jobs should be run on the ManeFrame cluster:

  • The worker nodes are mfc0001.hpc.smu.edu through mfc1104.hpc.smu.edu:
    • 20 of these are “high memory” nodes, with 192 GB of RAM each. Access to these nodes is obtained through requesting a specific queue type.
    • 1084 of these are “normal” nodes, with 24 GB of RAM each. Access to these nodes is obtained through request to a specific queue type.
  • All nodes have 8-core Intel(R) Xeon(R) CPU X5560 @ 2.80GHz 107 processors.
  • All nodes are connected by a 20Gbps DDR InfiniBand connection to the core backbone InfiniBand network.
  • Unless reserved for an interactive job, users cannot directly log in to the worker nodes.

Note

Of the 1104 compute nodes, not all are currently running in production. Administration efforts are under way to fix the remaining hardware issues and bring them online as available.

Disk nodes

All ManeFrame compute and login nodes may access a set of network filesystems, these filesystems reside on a set of separate dedicated disk nodes.

  • HOME: Home directories on ManeFrame reside on a NFS file system, located in /users (e.g. /users/rkalescky). When you log in to ManeFrame your home directory is where you will land in by default.

    This space should be used to write, edit, compile, and browse your programs and job submission scripts. You can use programs in your HOME to test them interactively on an interactive node obtained from the job scheduler, as described below.

    Note

    $HOME space is restricted by quotas, because of limited space (2.5 TB total). Due to these storage constraints, this space is intended to preserve your important programs and job submission scripts. You should not use this space to store large data/input files that can be reproduced or dowloaded again. Please refrain from moving large data files to/from your home directory, instead move them to /scratch.

  • SCRATCH: The largest pool of storage on ManeFrame is located in /scratch/users/; all users have a dedicated directory there (e.g. /scratch/users/rkalescky). ManeFrame has approximately 1.2 PB of scratch space, that uses a high performance parallel Lustre file system.

    This space should serve as your default location for storage of large files or for large numbers of files used in computation.

    Additionally, all users have a directory named /scratch/users/<username>/_small. This directory corresponds with a smaller (~250 TB) high-speed scratch filesystem. Performance-wise, the SATA disks comprising /scratch/users/<username> operate at 400 MB/s, whereas the SAS disks comprising /scratch/users/<username>/_small operate at 450 MB/s.

    As a high-performance parallel filesystem, Lustre will not perform well if misused.

    Note

    SCRATCH is a volatile file system, meaning we do not guarantee that any of the files stored in SCRATCH can be retrieved or restored in the event of an accidental delete, loss or failure of the filesystem. Users are therefore encouraged to save their programs, job submission scripts and other non-reproducible files in $HOME or any other secondary storage system.

  • NFSSCRATCH: ManeFrame additionally has a set of “fast” storage, located in /nfsscratch/users/. These SSD drives have approximately ~2.2 TB of storage, and use a high performance NFS file system. Use of this storage space requires an approval from the Director of the Center for Scientific Computation, Dr. Thomas Hagstrom.

    Note

    due to size and the premium nature of this file system, users are required to automatically clean up the storage space after every job has finished running by bundling and moving the resulting files as part of the ‘epilog’ process of the job.

  • LOCAL_TEMP: ManeFrame’s worker nodes may also access a relative large amount of local temporary space for use during the executing of a job, located in /local_temp/users/. For example, when running the Gaussian application, files of size 100-400 GB are periodically dumped during the execution of a job.

  • SOFTWARE: All ManeFrame nodes may access a shared NFS disk that holds software, located in /grid/software. A typical user will never need to browse this directly, as the module system modifies environment variables to point at these installations automatically.

Users are encouraged to contact smuhpc-admins@smu.edu with questions regarding selecting the appropriate storage for their jobs.

General information

  • OS: Scientific Linux 6 (64 bit)
  • Scheduler: SLURM
  • The software stack on ManeFrame includes a variety of high performance mathematics and software libraries, as well as the GNU and PGI compiler suites. A full listing is always available with the module avail command.
  • The ManeFrame wiki page (requires SMU login) has more detailed information on the hardware and software configuration of the cluster.

Getting Along with LUSTRE

Presentation

The SLURM Job Scheduler

In general, a job scheduler is a program that manages unattended background program execution (a.k.a. batch processing). The basic features of any job scheduler include:

  • Interfaces which help to define workflows and/or job dependencies.
  • Automatic submission of executions.
  • Interfaces to monitor the executions.
  • Priorities and/or queues to control the execution order of unrelated jobs.

In the context of high-throughput and high-performance computing, the primary role of a job scheduler is to manage the job queue for all of the compute nodes of the cluster. It’s goal is typically to schedule queued jobs so that all of the compute nodes are utilized to their capacity, yet doing so in a fair manner that gives priority to users who have used less resources and/or contributed more to the acquisition of the system.

Some widely used cluster batch systems are:

  • Condor – this is used on the older SMUHPC cluster

ManeFrame’s SLURM Queues or Partitions

There are currently eight partitions also known as queues available on ManeFrame. These queues are designed to allow to various usage scenarios based on the calculations’s expected duration, its degree of parallelization, and its memory requirements with the goal of allowing fair access to computational resources for all users. The queues are described in the following table.

Queue Name Shared or Exclusive Duration Number of Nodes
development Shared 2 hours 15
serial Shared 24 hours 332
parallel-short Exclusive 24 hours 384
parallel-medium Exclusive 7 days 129
parallel-long Exclusive 30 days 92
highmem-short Exclusive 24 hours 10
highmem-medium Exclusive 7 days 5
highmem-long Exclusive 30 days 5

Types of Queues Available

development
The development queue is for program development and testing and running compute-intensive interactive applications. Each node in this queue can be shared with others unless exclusive access to the node is requested. The allocation can be kept for up to two hours.
serial
The serial queue is for single core jobs and is particularly for high-throughput scenarios. As this queue is meant for single-core jobs, multiple allocations may be assigned to a single node. The allocations using this queue can be kept for up to 24 hours.
parallel
The parallel queue is for multi-core and multi-node-multi-core parallel jobs, e.g. when using OpenMP or MPI based programs. This type of queue is subdivided into short-, medium-, and long-running queues with allocations for up to 24 hours, 7 days, and 30 days respectively. All allocations are given exclusive access to the nodes.
highmem
The highmem queue is for using the nodes with 192 GB ram. This type of queue is subdivided into short-, medium-, and long-running queues with allocations for up to 24 hours, 7 days, and 30 days respectively. The nodes can also be used with parallel applications and all allocations are given exclusive access to the nodes.

SLURM Commands

While there are a multitude of SLURM commands, here we’ll focus on those applicable to running batch and interactive jobs:

  • sinfo – displays information about SLURM nodes and partitions (queue types). A full list of options is available here. The usage command (with the most-helpful optional arguments in brackets) is

    $ sinfo [-a] [-l] [-n <nodes>] [-p <partition>] [-s] [-a] [-a] [-a]
    

    where these options are:

    • -a or --all – Display information about all partitions
    • -l or --long – Displays more detailed information
    • -n <nodes> or --nodes <nodes> – Displays information only about the specified node(s). Multiple nodes may be comma separated or expressed using a node range expression. For example mfc[1005-1007].hpc.smu.edu would indicate three nodes, mfc1005.hpc.smu.edu through mfc1007.hpc.smu.edu.
    • -p <partition> or --partition <partition> – Displays information only about the specified partition
    • -s or --summarize – List only a partition state summary with no node state details.

    Examples:

    $ sinfo --long -p highmem  # long output for all nodes allocated to the "highmem" partition
    $ sinfo -s                 # summarizes output on all nodes on all partitions
    
  • squeue – views information about jobs located in the SLURM scheduling queue. A full list of options is available here. The usage command (with the most-helpful optional arguments in brackets) is

    $ squeue [-a] [-j] [-l] [-p] [--start] [-u]
    

    where these options are:

    • -a or --all – Display information about jobs and job steps in all partions.
    • -j <job_id_list> or --jobs <job_id_list> – Requests a comma separated list of job ids to display. Defaults to all jobs.
    • -l or --long – Reports more of the available information for the selected jobs or job steps, subject to any constraints specified.
    • -p <part_list> or --partition <part_list> – Specifies the partitions of the jobs or steps to view. Accepts a comma separated list of partition names.
    • --start – Reports the expected start time of pending jobs, in order of increasing start time.
    • -u <user_list> or --user <user_list> – Requests jobs or job steps from a comma separated list of users. The list can consist of user names or user id numbers.

    Examples:

    $ squeue                            # all jobs
    $ squeue -u rkalescky --start       # anticipated start time of rkalescky' jobs
    $ squeue --jobs 12345,12346,12348   # information on only jobs 12345, 12346 and 12348
    
  • sbatch – submits a batch script to SLURM. A full list of options is available here. The usage command is

    $ sbatch [options] <script> [args]
    

    where <script> is a batch submission script, and [args] are any optional arguments that should be supplied to <script>. The sbatch command accepts a multitude of options; these options may be supplied either at the command-line or inside the batch submission script.

    It is recommended that all options be specified inside the batch submission file, to ensure reproducibility of results (i.e. so that the same options are specified on each run, and no options are accidentally left out). Any command-line sbatch option may equivalently be specified within this script (at the top, before any executable commands), preceded by the text #SBATCH.

    These options are discussed in the following section, Batch job submission file.

    Examples:

    $ sbatch ./myscript.sh    # submits the batch submission file "myscript.sh" to SLURM
    
  • srun – runs a parallel or interactive job on the worker nodes. A full list of options is available here. The usage command (with the most-helpful optional arguments in brackets) is

    $ srun [-D <path>] [-e <errf>] [--epilog=<executable>] [-o <outf>] [-p <part>] [--pty] [--x11] <executable>
    

    where these options are:

    • -D <path> or --chdir=<path> – have the remote processes change directories <path> before beginning execution. The default is to change to the current working directory of the srun process.
    • -e <errf> or --error=<errf> – redirects stderr to the file <errf>
    • --epilog=<executable> – run executable just after the job completes. The command line arguments for executable will be the command and arguments of the job itself. If executable is “none”, then no epilog will be run.
    • -I or --immediate[=secs] – exit if requested resources not available in “secs” seconds (useful for interactive jobs).
    • -o <outf> or --output=<outf> – redirects stdout to the file <outf>
    • -p <part> or --partition=<part> – requests that the job be run on the requested partition.
    • -N <num> or --nodes=<num> – requests that the job be run using <num> nodes. Primarily useful for running parallel jobs
    • -n <num> or --ntasks=<num> – requests that the job be run using <num> tasks. The default is one task per node. Primarily useful for running parallel jobs
    • --pty – requests that the task be run in a pseudo-terminal
    • -t <min> or --time=<min> – sets a limit on the total run time of the job. The default/maximum time limit is defined on a per-partition basis.
    • --x11=[batch|first|last|all] – exports the X11 display from the first|last|all allocated node(s), so that graphics displayed by this process can be forwarded to your screen.
    • <executable> – the actual program to run.

    Examples:

    $ srun -p parallel /bin/program # runs executable /bin/program on "parallel" partition
    $ srun --x11=first --pty emacs  # runs "emacs" and forwards graphics
    $ srun --x11=first --pty $SHELL # runs a the user's current shell and forwards graphics
    
  • salloc – obtains a SLURM job allocation (a set of nodes), executes a command, and then releases the allocation when the command is finished. A full list of options is available here. The usage command is

    $ salloc [options] <command> [command args]
    

    where <command> [command args] specifies the command (and any arguments) to run. Available options are almost identical to srun, including:

    • -D <path> or --chdir=<path> – change directory to <path> before beginning execution.
    • -I or --immediate[=secs] – exit if requested resources not available in “secs” seconds (useful for interactive jobs).
    • -p <part> or --partition=<part> – requests that the job be run on the requested partition.
    • -t <min> or --time=<min> – sets a limit on the total run time of the job. The default/maximum time limit is defined on a per-partition basis.
    • --x11=[batch|first|last|all] – exports the X11 display from the first|last|all allocated node(s), so that graphics displayed by this process can be forwarded to your screen.
  • scancel – kills jobs or job steps that are under the control of SLURM (and listed by squeue. A full list of options is available here. The usage command (with the most-helpful optional arguments in brackets) is

    $ scancel [-i] [-n <job_name>] [-p <part>] [-t <state>] [-u <uname>] [jobid]
    

    where these options are:

    • -i or --interactive – require response from user for each job (used when cancelling multiple jobs at once)
    • -n <job_name> or --name=<job_name> – cancel only on jobs with the specified name.
    • -p <part> or --partition=<part> – cancel only on jobs in the specified partition.
    • -t <state> or --state=<state> – cancel only on jobs in the specified state. Valid job states are PENDING, RUNNING and SUSPENDED
    • -u <uname> or --user=<uname> – cancel only on jobs of the specified user (note: normal users can only cancel their own jobs).
    • jobid is the numeric job identifier (as shown by squeue) of the job to cancel.

    Examples:

    $ scancel 1234  # cancel job number 1234
    $ scancel -u rkalescky  # cancel all jobs owned by user "rkalescky"
    $ scancel -t PENDING -u joe  # cancel all pending jobs owned by user "joe"
    

Example: Running Interactive Jobs

In this example, we’ll interactively run the Python scrpit myjob.py, that performs a simple algorithm for approximating \(\pi\) using a composite trapezoidal numerical integration formula to approximate

\[\int_0^1 \frac{4}{1+x^2}\,\mathrm dx\]

This script accepts a single integer-valued command-line argument, corresponding to the number of subintervals to use in the approximation, with the typical tradeoff that the harder you work, the better your answer.

While you can run this at the command line:

$ python ./myjob.py 50

as we increase the number of subintervals to obtain a more accurate approximation it can take longer to run, so as “good citizens” we should instead run it on dedicated compute nodes instead of the shared login nodes.

Before running this script on a compute node, we need to ensure that myjob.py has “executable” permissions:

$ chmod +x ./myjob.py

We’ll use srun to run this script interactively for interval values of {50,500,5000,50000}. For each run, we’ll direct the output to a separate file:

$ srun -o run_50.txt ./myjob.py 50
$ srun -o run_500.txt ./myjob.py 500
$ srun -o run_5000.txt ./myjob.py 5000
$ srun -o run_50000.txt ./myjob.py 50000

Upon completion you should have the files run_50.txt, run_500.txt, run_5000.txt and run_50000.txt in your directory. View the results to ensure that things ran properly:

$ cat run_*

Note

in the above commands we do not need to directly specify to run on the “interactive” SLURM partition, since that is the default partition.

Batch Job Submission File

The standard way that a user submits batch jobs to run on SLURM is through creating a job submission file that describes (and executes) the job you want to run. This is the <script> file specified to the sbatch command.

A batch submission script is just that, a shell script. You are welcome to use your preferred shell scripting language; the example shown here uses BASH. Thus, the script typically starts with the line

#!/bin/bash

The following lines (before any executable commands) contain the options to be supplied to the sbatch command. Each of these options must be prepended with the text #SBATCH, e.g.

#!/bin/bash
#SBATCH -J my_program       # job name to display in squeue
#SBATCH -o output-%j.txt    # standard output file
#SBATCH -e error-%j.txt     # standard error file
#SBATCH -p parallel         # requested partition
#SBATCH -t 180              # maximum runtime in minutes

Since each of these sbatch options begins with the character #, they are treated as comments by the BASH shell; however sbatch parses the file to find these and supply them as options for the job.

After all of the requested options have been specified, you can supply any number of executable lines, variable definitions, and even functions, as with any other BASH script.

Unlike general BASH scripts, there are a few SLURM replacement symbols (variables) that may be used within your script:

  • %A – the master job allocation number (only meaningful for job arrays (advanced usage))
  • %a – the job array ID (index) number (also only meaningful for job arrays)
  • %j – the job allocation number (the number listed by squeue)
  • %N – the node name. If running a job on multiple nodes, this will map to only the first node on the job (i.e. the node that actually runs the script).
  • %u – your username

The available options to sbatch are numerous. Here we list the most useful options for running serial batch jobs.

  • -D <dir> or --workdir=<dir> – sets the working directory where the batch script should be run, e.g.

    #SBATCH -D /scratch/users/ezekiel/test_run
    
  • -J <name> or --job-name=<name> – sets the job name as output by the squeue command, e.g.

    #SBATCH -J test_job
    
  • -o <fname> – sets the output file name for stdout and stderr (if stderr is left unspecified). The default standard output is directed to a file of the name slurm-%j.out, where %j corresponds to the job ID number. You can do something similar, e.g.

    #SBATCH -o output-%j.txt
    
  • -e <fname> – sets the output file name for stderr only. The default is to combine this with stdout. An example similar to -o above would be

    #SBATCH -e error-%j.txt
    
  • -i <fname> or --input=<fname> – sets the standard input stream for the running job. For example, if an executable program will prompt the user for text input, these inputs may be placed in a file inputs.txt and specified to the script via

    #SBATCH -i inputs.txt
    
  • -p <part> – tells SLURM on which partition it should submit the job. The options are “interactive”, “highmem” or “parallel”. For example, so submit a batch job to a high-memory node you would use

    #SBATCH -p highmem
    
  • -t <num> – tells SLURM the maximum runtime to be allowed for the job (in minutes). For example, to allow a job to run for up to 3 hours you would use

    #SBATCH -t 180
    
  • --exclusive – tells SLURM that the job can not share nodes with other running jobs.

    #SBATCH --exclusive
    
  • -s or --share – tells SLURM that the job can share nodes with other running jobs. This is the opposite of --exclusive, whichever option is seen last on the command line will be used. This option may result the allocation being granted sooner than if the --share option was not set and allow higher system utilization, but application performance will likely suffer due to competition for resources within a node.

    #SBATCH -s
    

    Note

    of the three ManeFrame partitions, job-based shared/exclusive control is only available for parallel and highmem; the interactive queue forces “shared” usage, with up to four shared jobs per node.

  • --mail-user <email address> – tells SLURM your email address if you’d like to receive job-related email notifications, e.g.

    #SBATCH --mail-user peruna@smu.edu
    
  • --mail-type=<flag> – tells SLURM which types of email notification messages you wish to receive. Options include:

    • BEGIN – send a message when the run starts
    • END – send a message when the run ends
    • FAIL – send a message if the run failed for some reason
    • REQUEUE – send a message if and when the job is requeued
    • ALL – send a message for all of the above

    For example,

    #SBATCH --mail-type=all
    

Running Batch Jobs

We’ll run this in three ways:

  1. first, we run myjob.sh one time, potentially sharing computing resources on the worker node,
  2. second, we run myjob.sh multiple times from a single job submission, requesting that the node not be shared with others,
  3. third, we set up a suite of jobs to run and submit the set to run simultaneously on the worker nodes.

Example: Running a Single Job on a Shared Node

#!/bin/bash
#SBATCH -J myjob          # job name
#SBATCH -o test1.txt      # output/error file
#SBATCH -p parallel       # requested queue
#SBATCH -t 1              # maximum runtime in minutes

# run for first 100 primes
./myjob.sh 100

Submit this job to SLURM using the sbatch command:

$ sbatch ./test1.job

After this returns, you can monitor the progress of your job in the queue via the squeue command, e.g.

$ squeue -u $USER

Note

in the above command, the environment variable USER is evaluated in the command, limiting output to only your own jobs).

When your job completes, you should have a new file, test1.txt in your directory, containing the output from running the job. To verify that it computed the first 100 primes, you can check the length of the file, e.g.

$ wc -l test1.txt

Of course, you could also go through the file line-by-line ensuring that each value is indeed a prime number.

Example: Running a Suite of Tests in a Single Job (Non-shared Node)

Suppose instead that you wish to run many short-running tests, and want to run these back-to-back after only waiting once to have your job make it through the queue. Since the SLURM submission file is just a shell script, you can run many tests inside the same submission.

#!/bin/bash
#SBATCH -J myjob2         # job name
#SBATCH -o test2.txt      # output/error file name
#SBATCH -p parallel       # requested queue
#SBATCH --exclusive       # do not share the compute node
#SBATCH -t 10             # maximum runtime in minutes

# first run for 200 primes, placing output in run_200.txt, and timing run
echo "  "
echo "running for 200 primes"
time -p ./myjob.sh 200 > run_200.txt

# run again for 2000 primes,
echo "  "
echo "running for 2000 primes"
time -p ./myjob.sh 2000 > run_2000.txt

# run again for 20000 primes,
echo "  "
echo "running for 20000 primes"
time -p ./myjob.sh 20000 > run_20000.txt

Again, submit this job to SLURM using the sbatch command:

$ sbatch ./test2.job

After this returns, you can monitor the progress of your job in the queue via the squeue command, e.g.

$ squeue -u $USER

This job will take significantly longer to complete, since we are not only running myjob.sh three times, but each one runs for significantly further.

When your job completes, you should have four new files, test2.txt that contains the run timing information for each test, run_200.txt that contains the first 200 primes, run_2000.txt that contains the first 2000 primes, and run_20000.txt that contains the first 20000 prime numbers. You can check the length of these files again using wc, e.g.

$ wc -l run*.txt

Note

Investigate the timings output in your file multitest.txt. Note that as the requested number of primes increases by a factor of 10, the required run time increases by significantly more than a factor of 10. These results help measure the complexity of this algorithm.

Example: Running a Suite of Tests Simultaneously

Now suppose again that you wish to run a large number of tests, but that these tests may take somewhat longer, and you want these to run simultanously on separate worker nodes. To do this in SLURM, one approach would be to write many submission files and submit each job separately. While the process of creating multiple files that differ only in minimal ways can be tedious to do by hand, a shell script can do this with ease.

#!/bin/bash

# set up array of test sizes to run
NPRIMES=(150 1500 15000)

# iterate over this array, setting up and submitting
# separate job files for each test
for n in "${NPRIMES[@]}"
do

   JOBFILE=test_$n.job                 # set job file name
   echo "#!/bin/bash" > $JOBFILE       # create job file

   # append sbatch options into file header
   echo "#SBATCH -J job_$n" >> $JOBFILE
   echo "#SBATCH -o test_$n.txt" >> $JOBFILE
   echo "#SBATCH -p parallel" >> $JOBFILE
   echo "#SBATCH -t 10" >> $JOBFILE

   # append test execution commands into job file
   echo "time -p ./myjob.sh $n > run_$n.txt" >> $JOBFILE

   # submit job to queue
   sbatch $JOBFILE

done

# check the queue to see that all jobs were submitted
squeue -u $USER

In order to run this shell script we need to give it “execute” permissions,

$ chmod +x ./runtests.sh

With this in place, we need only run this one shell script to set up and launch our jobs:

$ ./runtests.sh

Upon running this script, your new jobs are in the SLURM queue and can execute concurrently. Moreover, you have reusable batch submission scripts for each run in case something goes awry with one of the runs.