7. Testing InfiniBand with LAMMPS

To do a demonstration of how to compile and run an MPI application, we’ll do a test using the LAMMPS molecular dynamics code. It’s a widely used simulation engine to study processes at the atomic scale and runs on the largest supercomputers in the world.

7.1. Installing LAMMPS into user home

Download the LAMMPS source code from GitHub

[testuser@master ~]$ git clone -b patch_10Feb2021 --depth 1 https://github.com/lammps/lammps.git
[testuser@master ~]# cd lammps

Configure LAMMPS with CMake

Note

This will require the CMake module we’ve installed in Exercise 6: Software modules

[testuser@master lammps]$ module load cmake
[testuser@master lammps]$ mkdir build
[testuser@master lammps]$ cd build
[testuser@master build]$ cmake -C ../cmake/presets/minimal.cmake -D CMAKE_INSTALL_PREFIX=$HOME/opt/lammps ../cmake

Compile LAMMPS and install it into your user’s home directory

[testuser@master build]$ make -j 12
[testuser@master build]$ make install

7.2. Running LAMMPS

LAMMPS is now ready to be used, be we will need to set some environment variables to make sure everything can be found.

export LAMMPS_DIR=$HOME/opt/lammps
export PATH=$LAMMPS_DIR/bin:$PATH
export LD_LIBRARY_PATH=$LAMMPS_DIR/lib:$LD_LIBRARY_PATH
export LAMMPS_POTENTIALS=$LAMMPS_DIR/share/lammps/potentials

Note

Typically we install LAMMPS as module, but for our quick tests we’ll use it like this.

Verify that LAMMPS works as expected:

Set the necessary environment variables mentioned above

Launch the in.melt example from lammps/examples/melt in serial mode

[testuser@master ~]$ cd lammps/examples/melt
[testuser@master melt]$ lmp -in in.melt
LAMMPS (10 Feb 2021)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:94)
  using 1 OpenMP thread(s) per MPI task
Lattice spacing in x,y,z = 1.6795962 1.6795962 1.6795962
Created orthogonal box = (0.0000000 0.0000000 0.0000000) to (16.795962 16.795962 16.795962)
  1 by 1 by 1 MPI processor grid
Created 4000 atoms
  create_atoms CPU = 0.001 seconds
Neighbor list info ...
  update every 20 steps, delay 0 steps, check no
  max neighbors/atom: 2000, page size: 100000
  master list distance cutoff = 2.8
  ghost atom cutoff = 2.8
  binsize = 1.4, bins = 12 12 12
  1 neighbor lists, perpetual/occasional/extra = 1 0 0
  (1) pair lj/cut, perpetual
      attributes: half, newton on
      pair build: half/bin/atomonly/newton
      stencil: half/bin/3d/newton
      bin: standard
Setting up Verlet run ...
  Unit style    : lj
  Current step  : 0
  Time step     : 0.005
Per MPI rank memory allocation (min/avg/max) = 3.222 | 3.222 | 3.222 Mbytes
Step Temp E_pair E_mol TotEng Press
       0            3   -6.7733681            0   -2.2744931   -3.7033504
      50    1.6758903   -4.7955425            0   -2.2823355     5.670064
     100    1.6458363   -4.7492704            0   -2.2811332    5.8691042
     150    1.6324555   -4.7286791            0    -2.280608    5.9589514
     200    1.6630725   -4.7750988            0   -2.2811136    5.7364886
     250    1.6275257   -4.7224992            0    -2.281821    5.9567365
Loop time of 0.959769 on 1 procs for 250 steps with 4000 atoms

Performance: 112527.020 tau/day, 260.479 timesteps/s
100.0% CPU use with 1 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 0.82686    | 0.82686    | 0.82686    |   0.0 | 86.15
Neigh   | 0.098545   | 0.098545   | 0.098545   |   0.0 | 10.27
Comm    | 0.014394   | 0.014394   | 0.014394   |   0.0 |  1.50
Output  | 0.00025191 | 0.00025191 | 0.00025191 |   0.0 |  0.03
Modify  | 0.016137   | 0.016137   | 0.016137   |   0.0 |  1.68
Other   |            | 0.003585   |            |       |  0.37

Nlocal:        4000.00 ave        4000 max        4000 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Nghost:        5499.00 ave        5499 max        5499 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Neighs:        151513.0 ave      151513 max      151513 min
Histogram: 1 0 0 0 0 0 0 0 0 0

Total # of neighbors = 151513
Ave neighs/atom = 37.878250
Neighbor list builds = 12
Dangerous builds not checked
Total wall time: 0:00:00

Try running LAMMPS with MPI

[testuser@master melt]$ mpirun -np 4 lmp -in in.melt
LAMMPS (10 Feb 2021)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:94)
  using 1 OpenMP thread(s) per MPI task
Lattice spacing in x,y,z = 1.6795962 1.6795962 1.6795962
Created orthogonal box = (0.0000000 0.0000000 0.0000000) to (16.795962 16.795962 16.795962)
  1 by 2 by 2 MPI processor grid
Created 4000 atoms
  create_atoms CPU = 0.001 seconds
Neighbor list info ...
  update every 20 steps, delay 0 steps, check no
  max neighbors/atom: 2000, page size: 100000
  master list distance cutoff = 2.8
  ghost atom cutoff = 2.8
  binsize = 1.4, bins = 12 12 12
  1 neighbor lists, perpetual/occasional/extra = 1 0 0
  (1) pair lj/cut, perpetual
      attributes: half, newton on
      pair build: half/bin/atomonly/newton
      stencil: half/bin/3d/newton
      bin: standard
Setting up Verlet run ...
  Unit style    : lj
  Current step  : 0
  Time step     : 0.005
Per MPI rank memory allocation (min/avg/max) = 2.706 | 2.706 | 2.706 Mbytes
Step Temp E_pair E_mol TotEng Press
       0            3   -6.7733681            0   -2.2744931   -3.7033504
      50    1.6754119   -4.7947589            0   -2.2822693    5.6615925
     100    1.6503357    -4.756014            0   -2.2811293    5.8050524
     150    1.6596605   -4.7699432            0   -2.2810749    5.7830138
     200    1.6371874   -4.7365462            0   -2.2813789    5.9246674
     250    1.6323462   -4.7292021            0   -2.2812949    5.9762238
Loop time of 0.295548 on 4 procs for 250 steps with 4000 atoms

Performance: 365422.843 tau/day, 845.886 timesteps/s
95.5% CPU use with 4 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 0.21616    | 0.22506    | 0.23363    |   1.5 | 76.15
Neigh   | 0.02635    | 0.02702    | 0.027748   |   0.4 |  9.14
Comm    | 0.026872   | 0.036185   | 0.045796   |   4.2 | 12.24
Output  | 0.00010878 | 0.00012555 | 0.00017277 |   0.0 |  0.04
Modify  | 0.0042559  | 0.0043301  | 0.0044012  |   0.1 |  1.47
Other   |            | 0.002825   |            |       |  0.96

Nlocal:        1000.00 ave        1010 max         982 min
Histogram: 1 0 0 0 0 0 1 0 0 2
Nghost:        2703.75 ave        2713 max        2689 min
Histogram: 1 0 0 0 0 0 0 2 0 1
Neighs:        37915.5 ave       39239 max       36193 min
Histogram: 1 0 0 0 0 1 1 0 0 1

Total # of neighbors = 151662
Ave neighs/atom = 37.915500
Neighbor list builds = 12
Dangerous builds not checked
Total wall time: 0:00:00

7.3. Running LAMMPS on our cluster

We will now launch a series of LAMMPS jobs on our cluster to test it. The LAMMPS input script we will use is the in.lj found in the lammps/bench/ folder. It’s a simple benchmark with a grid of 32000 atoms. You can adjust the size of the simulation by passing variables x, y and z to the input script. This will multiply the grid in each dimension. E.g., with x=2,y=2,z=2 the grid will be double in size in each dimension and have 8 times the number of atoms (256000). You can use this benchmark to look the strong and weak scaling.

Here is a Torque job script for running a LAMMPS benchmark on a single node

Note

Notice that we don’t need to specify the number of processors to mpirun. Due to the integration with Torque, mpirun will know how many MPI ranks to launch and where.

#!/bin/bash
#SBATCH -J lammps_test
#SBATCH -N 1
#SBATCH -n 12
#SBATCH -c 1
#SBATCH -t 10
#SBATCH -o %x-%j.out

export LAMMPS_DIR=$HOME/opt/lammps
export PATH=$LAMMPS_DIR/bin:$PATH
export LD_LIBRARY_PATH=$LAMMPS_DIR/lib:$LD_LIBRARY_PATH
export LAMMPS_POTENTIALS=$LAMMPS_DIR/share/lammps/potentials

cd $HOME/lammps/bench

# run with 32,000 atoms
mpirun lmp -in in.lj -v x 1 -v y 1 -v z 1
# run with 256,000 atoms
# mpirun lmp -in in.lj -v x 2 -v y 2 -v z 2

You can easily modify it to run on multiple nodes by changing the lines

#SBATCH -N 1
#SBATCH -n 12

Do some tests with modifying both N (number of nodes) and n (number of tasks) and increase the number of atoms by modifying the x, y and z parameter. Look at how the Loop time changes to observe how LAMMPS scales.

Note

If you want to test the effect IB has on the simulation time, you could try running without IB. To do this, pass the following additional parameters to mpirun:

mpirun --mca btl '^openib' lmp -in in.lj -v x 1 -v y 1 -v z 1