4. OpenMPI

Now that we have an operational IB fabric, we can let software make use of it. The most widely used application of RDMA over IB is message passing through an MPI library.

An MPI library has to be compiled and customized for each cluster. Since it is an integral part of a cluster software stack, any other software that is compiled with MPI support will have to be linked to one specific version.

This is why on our clusters we usually only have one MPI implementation installed (OpenMPI). Supercomputing centers with dedicated application staff will support multiple MPI implementations, such as OpenMPI, Intel MPI, MPICHv2, etc. and may compile software for all of them.

4.1. Compilation

Both the MPI library and the batch system will have to closely work together. Each batch job will allocate a certain number of resources, and the MPI library has to be able to determine where to launch its processes. We’re going to install OpenMPI and compile it with InfiniBand and Slurm support:

  1. First we have to make sure the Slurm development headers are installed on our master. This is one of the RPMs we’ve created in Exercise 8: Creating a Batch System with Slurm:

    # ensure torque development headers are installed
    [root@master ~] yum install slurm-devel
    
  2. Switch to the install user and download the OpenMPI source code from GitHub:

    [install@master ~]$ VERSION=4.1.6
    [install@master ~]$ git clone https://github.com/open-mpi/ompi.git
    [install@master ~]$ cd ompi
    [install@master ompi]$ git checkout v$VERSION
    
  3. The OpenMPI package is then configured with several options and with an install prefix inside of your software modules folders

    # generate configure script
    [install@master ompi]$ ./autogen.pl
    [install@master ompi]$ OPTIONS="--with-pic --with-slurm --with-verbs"
    [install@master ompi]$ INSTALL_PREFIX=/data/opt/base/openmpi-$VERSION
    [install@master ompi]$ ./configure $OPTIONS --prefix=$INSTALL_PREFIX
    
    --with-pic

    Compile with position independent code

    --with-slurm

    Use Slurm launcher instead of SSH

    --with-verbs

    Use low-level InfiniBand interconnect (libverbs library)

  4. Verify that the configuration completes with Torque and OpenFabrics Verbs enabled:

    Open MPI configuration:
    -----------------------
    Version: 4.1.6rc4
    Build MPI C bindings: yes
    Build MPI C++ bindings (deprecated): no
    Build MPI Fortran bindings: no
    MPI Build Java bindings (experimental): no
    Build Open SHMEM support: false (no spml)
    Debug build: no
    Platform file: (none)
    
    Miscellaneous
    -----------------------
    CUDA support: no
    HWLOC support: internal
    Libevent support: internal
    Open UCC: no
    PMIx support: Internal
    
    Transports
    -----------------------
    Cisco usNIC: no
    Cray uGNI (Gemini/Aries): no
    Intel Omnipath (PSM2): no
    Intel TrueScale (PSM): no
    Mellanox MXM: no
    Open UCX: no
    OpenFabrics OFI Libfabric: no
    OpenFabrics Verbs: yes
    Portals4: no
    Shared memory/copy in+copy out: yes
    Shared memory/Linux CMA: yes
    Shared memory/Linux KNEM: no
    Shared memory/XPMEM: no
    TCP: yes
    
    Resource Managers
    -----------------------
    Cray Alps: no
    Grid Engine: no
    LSF: no
    Moab: no
    Slurm: yes
    ssh/rsh: yes
    Torque: no
    
    OMPIO File Systems
    -----------------------
    DDN Infinite Memory Engine: no
    Generic Unix FS: yes
    IBM Spectrum Scale/GPFS: no
    Lustre: no
    PVFS2/OrangeFS: no
    
  5. Compile and install OpenMPI

    [install@master ompi]$ make -j 12
    [install@master ompi]$ make install
    

4.2. Software Module for OpenMPI

Now that OpenMPI is installed in the shared directory, we still need to define a software module file for it:

  1. Create the folders for the module

    [install@master ompi]$ mkdir -p /data/opt/base/modulefiles/mpi/openmpi
    
  2. Create the software module file /data/opt/base/modulefiles/mpi/openmpi/4.1.6.lua

    -- -*- lua -*-
    
    local version = "4.1.6"
    local prefix = "/data/opt/base/openmpi-" .. version
    
    family("mpi")
    add_property("lmod","sticky")
    
    whatis([[OpenMPI message passing interface package]])
    
    help([===[
    This module enables using message passing interface libraries
    of the OpenMPI distribution. The environment variables $PATH,
    $LD_LIBRARY_PATH, and $MANPATH accordingly.
    This version includes support for MPI threads.
    
    The following environment variables are defined for use in Makefiles:
    $MPI_DIR, $MPI_BIN, $MPI_INC, $MPI_LIB, $MPI_FORTRAN_MOD_DIR
    ]===])
    
    prepend_path("PATH", prefix .. "/bin")
    prepend_path("LD_LIBRARY_PATH", prefix .. "/lib")
    prepend_path("MANPATH", prefix .. "/share/man")
    setenv("MPI_BIN", prefix .. "/bin")
    setenv("MPI_SYSCONFIG", prefix .. "/etc")
    setenv("MPI_FORTRAN_MOD_DIR",prefix .. "/lib")
    setenv("MPI_INC", prefix .. "/include")
    setenv("MPI_LIB", prefix .. "/lib")
    setenv("MPI_MAN", prefix .. "/share/man")
    setenv("MPI_HOME", prefix)
    

This will allow you to load the mpi/openmpi module.

5. Installation as default software module

Some software packages should always be loaded on a cluster. One of them is the default MPI implementation that should be used. We want it to automatically be loaded when a user logs into the cluster.

  1. Create a file /data/opt/base/modulefiles/StdEnv.lua. Inside of this file we will load modules that should be available by default:

    load("mpi/openmpi")
    
  2. Add files in /etc/profile.d/ on master and all your compute nodes

    To automatically load StdEnv, we will need to add another file in /etc/profile.d/. To make sure it is loaded after module support is added, call it /etc/profile.d/z01_StdEnv.sh:

    if [ -z "$__Init_Default_Modules" ]; then
       export __Init_Default_Modules=1;
    
       ## ability to predefine elsewhere the default list
       LMOD_SYSTEM_DEFAULT_MODULES=${LMOD_SYSTEM_DEFAULT_MODULES:-"StdEnv"}
       export LMOD_SYSTEM_DEFAULT_MODULES
       module --initial_load --no_redirect restore
    else
       module refresh
    fi
    

    For other shells, you can define similar scripts (e.g. z01_StdEnv.csh)

    if ( ! $?__Init_Default_Modules )  then
      setenv __Init_Default_Modules 1
      if ( ! $?LMOD_SYSTEM_DEFAULT_MODULES ) then
        setenv LMOD_SYSTEM_DEFAULT_MODULES "StdEnv"
      endif
      module --initial_load restore
    else
      module refresh
    endif
    

Note

Make sure there is a copy of /etc/profile.d/z01_StdEnv.sh on all of your compute nodes!

Once these files are in place, open a new shell as a user. You will notice that module list will not be empty, but contain the loaded MPI module:

[install@master ~]$ module list

Currently Loaded Modules:
  1) mpi/openmpi/4.1.6 (S)   2) StdEnv

  Where:
   S:  Module is Sticky, requires --force to unload or purge

6. OpenMPI and RDMA configuration

There are some system configurations that have to be applied to your master and compute nodes to enable the use of RDMA by OpenMPI enabled applications.

6.1. Give regular users access to pinned memory

Applications that use RDMA will require some physical memory to be pinned. That means the kernel is not allowed to swap out these memory pages. This allows IB network devices to directly write into the memory without involving the CPU. Pinning memory is a privileged operation. These restrictions have to be lifted so regular users are able to request pinned memory for RDMA applications. You can either define a specific size limit for users, or set it to unlimited. This configuration is controlled by files in /etc/security/limits.d/.

Create a file /etc/security/limits.d/rdma.conf with the following contents on both master and all of your compute nodes.

* soft memlock unlimited
* hard memlock unlimited

You can verify that these settings have an effect by checking the settings with ulimit as user:

[testuser@master ~]$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 47275
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4096
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

6.2. Make OpenMPI use Infiniband by default

OpenMPI supports different network technologies for message passing. It will try a set of technologies, which is controlled by the btl configuration option. You can globally define these and other options in $INSTALL_PREFIX/etc/openmpi-mca-params.conf.

Add the following configuration to /data/opt/base/openmpi-4.1.6/etc/openmpi-mca-params.conf:

# use infiniband only by default.
btl = openib,vader,self
btl_openib_allow_ib = 1