4. OpenMPI
Now that we have an operational IB fabric, we can let software make use of it. The most widely used application of RDMA over IB is message passing through an MPI library.
An MPI library has to be compiled and customized for each cluster. Since it is an integral part of a cluster software stack, any other software that is compiled with MPI support will have to be linked to one specific version.
This is why on our clusters we usually only have one MPI implementation installed (OpenMPI). Supercomputing centers with dedicated application staff will support multiple MPI implementations, such as OpenMPI, Intel MPI, MPICHv2, etc. and may compile software for all of them.
4.1. Compilation
Both the MPI library and the batch system will have to closely work together. Each batch job will allocate a certain number of resources, and the MPI library has to be able to determine where to launch its processes. We’re going to install OpenMPI and compile it with InfiniBand and Slurm support:
First we have to make sure the Slurm development headers are installed on our master. This is one of the RPMs we’ve created in Exercise 8: Creating a Batch System with Slurm:
# ensure torque development headers are installed [root@master ~] yum install slurm-devel
Switch to the
install
user and download the OpenMPI source code from GitHub:[install@master ~]$ VERSION=4.1.6 [install@master ~]$ git clone https://github.com/open-mpi/ompi.git [install@master ~]$ cd ompi [install@master ompi]$ git checkout v$VERSION
The OpenMPI package is then configured with several options and with an install prefix inside of your software modules folders
# generate configure script [install@master ompi]$ ./autogen.pl [install@master ompi]$ OPTIONS="--with-pic --with-slurm --with-verbs" [install@master ompi]$ INSTALL_PREFIX=/data/opt/base/openmpi-$VERSION [install@master ompi]$ ./configure $OPTIONS --prefix=$INSTALL_PREFIX
--with-pic
Compile with position independent code
--with-slurm
Use Slurm launcher instead of SSH
--with-verbs
Use low-level InfiniBand interconnect (libverbs library)
Verify that the configuration completes with Torque and OpenFabrics Verbs enabled:
Open MPI configuration: ----------------------- Version: 4.1.6rc4 Build MPI C bindings: yes Build MPI C++ bindings (deprecated): no Build MPI Fortran bindings: no MPI Build Java bindings (experimental): no Build Open SHMEM support: false (no spml) Debug build: no Platform file: (none) Miscellaneous ----------------------- CUDA support: no HWLOC support: internal Libevent support: internal Open UCC: no PMIx support: Internal Transports ----------------------- Cisco usNIC: no Cray uGNI (Gemini/Aries): no Intel Omnipath (PSM2): no Intel TrueScale (PSM): no Mellanox MXM: no Open UCX: no OpenFabrics OFI Libfabric: no OpenFabrics Verbs: yes Portals4: no Shared memory/copy in+copy out: yes Shared memory/Linux CMA: yes Shared memory/Linux KNEM: no Shared memory/XPMEM: no TCP: yes Resource Managers ----------------------- Cray Alps: no Grid Engine: no LSF: no Moab: no Slurm: yes ssh/rsh: yes Torque: no OMPIO File Systems ----------------------- DDN Infinite Memory Engine: no Generic Unix FS: yes IBM Spectrum Scale/GPFS: no Lustre: no PVFS2/OrangeFS: no
Compile and install OpenMPI
[install@master ompi]$ make -j 12 [install@master ompi]$ make install
4.2. Software Module for OpenMPI
Now that OpenMPI is installed in the shared directory, we still need to define a software module file for it:
Create the folders for the module
[install@master ompi]$ mkdir -p /data/opt/base/modulefiles/mpi/openmpi
Create the software module file
/data/opt/base/modulefiles/mpi/openmpi/4.1.6.lua
-- -*- lua -*- local version = "4.1.6" local prefix = "/data/opt/base/openmpi-" .. version family("mpi") add_property("lmod","sticky") whatis([[OpenMPI message passing interface package]]) help([===[ This module enables using message passing interface libraries of the OpenMPI distribution. The environment variables $PATH, $LD_LIBRARY_PATH, and $MANPATH accordingly. This version includes support for MPI threads. The following environment variables are defined for use in Makefiles: $MPI_DIR, $MPI_BIN, $MPI_INC, $MPI_LIB, $MPI_FORTRAN_MOD_DIR ]===]) prepend_path("PATH", prefix .. "/bin") prepend_path("LD_LIBRARY_PATH", prefix .. "/lib") prepend_path("MANPATH", prefix .. "/share/man") setenv("MPI_BIN", prefix .. "/bin") setenv("MPI_SYSCONFIG", prefix .. "/etc") setenv("MPI_FORTRAN_MOD_DIR",prefix .. "/lib") setenv("MPI_INC", prefix .. "/include") setenv("MPI_LIB", prefix .. "/lib") setenv("MPI_MAN", prefix .. "/share/man") setenv("MPI_HOME", prefix)
This will allow you to load the mpi/openmpi
module.
5. Installation as default software module
Some software packages should always be loaded on a cluster. One of them is the default MPI implementation that should be used. We want it to automatically be loaded when a user logs into the cluster.
Create a file
/data/opt/base/modulefiles/StdEnv.lua
. Inside of this file we will load modules that should be available by default:load("mpi/openmpi")
Add files in
/etc/profile.d/
on master and all your compute nodesTo automatically load
StdEnv
, we will need to add another file in/etc/profile.d/
. To make sure it is loaded after module support is added, call it/etc/profile.d/z01_StdEnv.sh
:if [ -z "$__Init_Default_Modules" ]; then export __Init_Default_Modules=1; ## ability to predefine elsewhere the default list LMOD_SYSTEM_DEFAULT_MODULES=${LMOD_SYSTEM_DEFAULT_MODULES:-"StdEnv"} export LMOD_SYSTEM_DEFAULT_MODULES module --initial_load --no_redirect restore else module refresh fi
For other shells, you can define similar scripts (e.g.
z01_StdEnv.csh
)if ( ! $?__Init_Default_Modules ) then setenv __Init_Default_Modules 1 if ( ! $?LMOD_SYSTEM_DEFAULT_MODULES ) then setenv LMOD_SYSTEM_DEFAULT_MODULES "StdEnv" endif module --initial_load restore else module refresh endif
Note
Make sure there is a copy of /etc/profile.d/z01_StdEnv.sh
on all of your compute nodes!
Once these files are in place, open a new shell as a user. You will notice that
module list
will not be empty, but contain the loaded MPI module:
[install@master ~]$ module list
Currently Loaded Modules:
1) mpi/openmpi/4.1.6 (S) 2) StdEnv
Where:
S: Module is Sticky, requires --force to unload or purge
6. OpenMPI and RDMA configuration
There are some system configurations that have to be applied to your master and compute nodes to enable the use of RDMA by OpenMPI enabled applications.
6.1. Give regular users access to pinned memory
Applications that use RDMA will require some physical memory to be pinned. That
means the kernel is not allowed to swap out these memory pages. This allows
IB network devices to directly write into the memory without involving the CPU.
Pinning memory is a privileged operation. These restrictions have to be lifted
so regular users are able to request pinned memory for RDMA applications. You
can either define a specific size limit for users, or set it to unlimited
.
This configuration is controlled by files in /etc/security/limits.d/
.
Create a file /etc/security/limits.d/rdma.conf
with the following contents on both master
and all of your compute nodes.
* soft memlock unlimited
* hard memlock unlimited
You can verify that these settings have an effect by checking the settings with ulimit
as user:
[testuser@master ~]$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 47275
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 4096
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
6.2. Make OpenMPI use Infiniband by default
OpenMPI supports different network technologies for message passing. It will
try a set of technologies, which is controlled by the btl
configuration
option. You can globally define these and other options in
$INSTALL_PREFIX/etc/openmpi-mca-params.conf
.
Add the following configuration to /data/opt/base/openmpi-4.1.6/etc/openmpi-mca-params.conf
:
# use infiniband only by default.
btl = openib,vader,self
btl_openib_allow_ib = 1