7. Testing InfiniBand with LAMMPS

To do a demonstration of how to compile and run an MPI application, we’ll do a test using the LAMMPS molecular dynamics code. It’s a widely used simulation engine to study processes at the atomic scale and runs on the largest supercomputers in the world.

7.1. Installing LAMMPS into user home

  1. Download the LAMMPS source code from GitHub

    [testuser@master ~]$ git clone -b patch_10Feb2021 --depth 1 https://github.com/lammps/lammps.git
    [testuser@master ~]# cd lammps
    
  2. Configure LAMMPS with CMake

    Note

    This will require the CMake module we’ve installed in Exercise 6: Software modules

    [testuser@master lammps]$ module load cmake
    [testuser@master lammps]$ mkdir build
    [testuser@master lammps]$ cd build
    [testuser@master build]$ cmake -C ../cmake/presets/minimal.cmake -D CMAKE_INSTALL_PREFIX=$HOME/opt/lammps ../cmake
    
  3. Compile LAMMPS and install it into your user’s home directory

    [testuser@master build]$ make -j 12
    [testuser@master build]$ make install
    

7.2. Running LAMMPS

LAMMPS is now ready to be used, be we will need to set some environment variables to make sure everything can be found.

export LAMMPS_DIR=$HOME/opt/lammps
export PATH=$LAMMPS_DIR/bin:$PATH
export LD_LIBRARY_PATH=$LAMMPS_DIR/lib:$LD_LIBRARY_PATH
export LAMMPS_POTENTIALS=$LAMMPS_DIR/share/lammps/potentials

Note

Typically we install LAMMPS as module, but for our quick tests we’ll use it like this.

Verify that LAMMPS works as expected:

  1. Set the necessary environment variables mentioned above

  2. Launch the in.melt example from lammps/examples/melt in serial mode

    [testuser@master ~]$ cd lammps/examples/melt
    [testuser@master melt]$ lmp -in in.melt
    LAMMPS (10 Feb 2021)
    OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:94)
      using 1 OpenMP thread(s) per MPI task
    Lattice spacing in x,y,z = 1.6795962 1.6795962 1.6795962
    Created orthogonal box = (0.0000000 0.0000000 0.0000000) to (16.795962 16.795962 16.795962)
      1 by 1 by 1 MPI processor grid
    Created 4000 atoms
      create_atoms CPU = 0.001 seconds
    Neighbor list info ...
      update every 20 steps, delay 0 steps, check no
      max neighbors/atom: 2000, page size: 100000
      master list distance cutoff = 2.8
      ghost atom cutoff = 2.8
      binsize = 1.4, bins = 12 12 12
      1 neighbor lists, perpetual/occasional/extra = 1 0 0
      (1) pair lj/cut, perpetual
          attributes: half, newton on
          pair build: half/bin/atomonly/newton
          stencil: half/bin/3d/newton
          bin: standard
    Setting up Verlet run ...
      Unit style    : lj
      Current step  : 0
      Time step     : 0.005
    Per MPI rank memory allocation (min/avg/max) = 3.222 | 3.222 | 3.222 Mbytes
    Step Temp E_pair E_mol TotEng Press
           0            3   -6.7733681            0   -2.2744931   -3.7033504
          50    1.6758903   -4.7955425            0   -2.2823355     5.670064
         100    1.6458363   -4.7492704            0   -2.2811332    5.8691042
         150    1.6324555   -4.7286791            0    -2.280608    5.9589514
         200    1.6630725   -4.7750988            0   -2.2811136    5.7364886
         250    1.6275257   -4.7224992            0    -2.281821    5.9567365
    Loop time of 0.959769 on 1 procs for 250 steps with 4000 atoms
    
    Performance: 112527.020 tau/day, 260.479 timesteps/s
    100.0% CPU use with 1 MPI tasks x 1 OpenMP threads
    
    MPI task timing breakdown:
    Section |  min time  |  avg time  |  max time  |%varavg| %total
    ---------------------------------------------------------------
    Pair    | 0.82686    | 0.82686    | 0.82686    |   0.0 | 86.15
    Neigh   | 0.098545   | 0.098545   | 0.098545   |   0.0 | 10.27
    Comm    | 0.014394   | 0.014394   | 0.014394   |   0.0 |  1.50
    Output  | 0.00025191 | 0.00025191 | 0.00025191 |   0.0 |  0.03
    Modify  | 0.016137   | 0.016137   | 0.016137   |   0.0 |  1.68
    Other   |            | 0.003585   |            |       |  0.37
    
    Nlocal:        4000.00 ave        4000 max        4000 min
    Histogram: 1 0 0 0 0 0 0 0 0 0
    Nghost:        5499.00 ave        5499 max        5499 min
    Histogram: 1 0 0 0 0 0 0 0 0 0
    Neighs:        151513.0 ave      151513 max      151513 min
    Histogram: 1 0 0 0 0 0 0 0 0 0
    
    Total # of neighbors = 151513
    Ave neighs/atom = 37.878250
    Neighbor list builds = 12
    Dangerous builds not checked
    Total wall time: 0:00:00
    
  3. Try running LAMMPS with MPI

    [testuser@master melt]$ mpirun -np 4 lmp -in in.melt
    LAMMPS (10 Feb 2021)
    OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:94)
      using 1 OpenMP thread(s) per MPI task
    Lattice spacing in x,y,z = 1.6795962 1.6795962 1.6795962
    Created orthogonal box = (0.0000000 0.0000000 0.0000000) to (16.795962 16.795962 16.795962)
      1 by 2 by 2 MPI processor grid
    Created 4000 atoms
      create_atoms CPU = 0.001 seconds
    Neighbor list info ...
      update every 20 steps, delay 0 steps, check no
      max neighbors/atom: 2000, page size: 100000
      master list distance cutoff = 2.8
      ghost atom cutoff = 2.8
      binsize = 1.4, bins = 12 12 12
      1 neighbor lists, perpetual/occasional/extra = 1 0 0
      (1) pair lj/cut, perpetual
          attributes: half, newton on
          pair build: half/bin/atomonly/newton
          stencil: half/bin/3d/newton
          bin: standard
    Setting up Verlet run ...
      Unit style    : lj
      Current step  : 0
      Time step     : 0.005
    Per MPI rank memory allocation (min/avg/max) = 2.706 | 2.706 | 2.706 Mbytes
    Step Temp E_pair E_mol TotEng Press
           0            3   -6.7733681            0   -2.2744931   -3.7033504
          50    1.6754119   -4.7947589            0   -2.2822693    5.6615925
         100    1.6503357    -4.756014            0   -2.2811293    5.8050524
         150    1.6596605   -4.7699432            0   -2.2810749    5.7830138
         200    1.6371874   -4.7365462            0   -2.2813789    5.9246674
         250    1.6323462   -4.7292021            0   -2.2812949    5.9762238
    Loop time of 0.295548 on 4 procs for 250 steps with 4000 atoms
    
    Performance: 365422.843 tau/day, 845.886 timesteps/s
    95.5% CPU use with 4 MPI tasks x 1 OpenMP threads
    
    MPI task timing breakdown:
    Section |  min time  |  avg time  |  max time  |%varavg| %total
    ---------------------------------------------------------------
    Pair    | 0.21616    | 0.22506    | 0.23363    |   1.5 | 76.15
    Neigh   | 0.02635    | 0.02702    | 0.027748   |   0.4 |  9.14
    Comm    | 0.026872   | 0.036185   | 0.045796   |   4.2 | 12.24
    Output  | 0.00010878 | 0.00012555 | 0.00017277 |   0.0 |  0.04
    Modify  | 0.0042559  | 0.0043301  | 0.0044012  |   0.1 |  1.47
    Other   |            | 0.002825   |            |       |  0.96
    
    Nlocal:        1000.00 ave        1010 max         982 min
    Histogram: 1 0 0 0 0 0 1 0 0 2
    Nghost:        2703.75 ave        2713 max        2689 min
    Histogram: 1 0 0 0 0 0 0 2 0 1
    Neighs:        37915.5 ave       39239 max       36193 min
    Histogram: 1 0 0 0 0 1 1 0 0 1
    
    Total # of neighbors = 151662
    Ave neighs/atom = 37.915500
    Neighbor list builds = 12
    Dangerous builds not checked
    Total wall time: 0:00:00
    

7.3. Running LAMMPS on our cluster

We will now launch a series of LAMMPS jobs on our cluster to test it. The LAMMPS input script we will use is the in.lj found in the lammps/bench/ folder. It’s a simple benchmark with a grid of 32000 atoms. You can adjust the size of the simulation by passing variables x, y and z to the input script. This will multiply the grid in each dimension. E.g., with x=2,y=2,z=2 the grid will be double in size in each dimension and have 8 times the number of atoms (256000). You can use this benchmark to look the strong and weak scaling.

Here is a Torque job script for running a LAMMPS benchmark on a single node

Note

Notice that we don’t need to specify the number of processors to mpirun. Due to the integration with Torque, mpirun will know how many MPI ranks to launch and where.

#!/bin/bash
#SBATCH -J lammps_test
#SBATCH -N 1
#SBATCH -n 12
#SBATCH -c 1
#SBATCH -t 10
#SBATCH -o %x-%j.out

export LAMMPS_DIR=$HOME/opt/lammps
export PATH=$LAMMPS_DIR/bin:$PATH
export LD_LIBRARY_PATH=$LAMMPS_DIR/lib:$LD_LIBRARY_PATH
export LAMMPS_POTENTIALS=$LAMMPS_DIR/share/lammps/potentials

cd $HOME/lammps/bench

# run with 32,000 atoms
mpirun lmp -in in.lj -v x 1 -v y 1 -v z 1
# run with 256,000 atoms
# mpirun lmp -in in.lj -v x 2 -v y 2 -v z 2

You can easily modify it to run on multiple nodes by changing the lines

#SBATCH -N 1
#SBATCH -n 12

Do some tests with modifying both N (number of nodes) and n (number of tasks) and increase the number of atoms by modifying the x, y and z parameter. Look at how the Loop time changes to observe how LAMMPS scales.

Note

If you want to test the effect IB has on the simulation time, you could try running without IB. To do this, pass the following additional parameters to mpirun:

mpirun --mca btl '^openib' lmp -in in.lj -v x 1 -v y 1 -v z 1