7. Testing InfiniBand with LAMMPS
To do a demonstration of how to compile and run an MPI application, we’ll do a test using the LAMMPS molecular dynamics code. It’s a widely used simulation engine to study processes at the atomic scale and runs on the largest supercomputers in the world.
7.1. Installing LAMMPS into user home
Download the LAMMPS source code from GitHub
[testuser@master ~]$ git clone -b patch_10Feb2021 --depth 1 https://github.com/lammps/lammps.git [testuser@master ~]# cd lammps
Configure LAMMPS with CMake
Note
This will require the CMake module we’ve installed in Exercise 6: Software modules
[testuser@master lammps]$ module load cmake [testuser@master lammps]$ mkdir build [testuser@master lammps]$ cd build [testuser@master build]$ cmake -C ../cmake/presets/minimal.cmake -D CMAKE_INSTALL_PREFIX=$HOME/opt/lammps ../cmake
Compile LAMMPS and install it into your user’s home directory
[testuser@master build]$ make -j 12 [testuser@master build]$ make install
7.2. Running LAMMPS
LAMMPS is now ready to be used, be we will need to set some environment variables to make sure everything can be found.
export LAMMPS_DIR=$HOME/opt/lammps
export PATH=$LAMMPS_DIR/bin:$PATH
export LD_LIBRARY_PATH=$LAMMPS_DIR/lib:$LD_LIBRARY_PATH
export LAMMPS_POTENTIALS=$LAMMPS_DIR/share/lammps/potentials
Note
Typically we install LAMMPS as module, but for our quick tests we’ll use it like this.
Verify that LAMMPS works as expected:
Set the necessary environment variables mentioned above
Launch the
in.melt
example fromlammps/examples/melt
in serial mode[testuser@master ~]$ cd lammps/examples/melt [testuser@master melt]$ lmp -in in.melt LAMMPS (10 Feb 2021) OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:94) using 1 OpenMP thread(s) per MPI task Lattice spacing in x,y,z = 1.6795962 1.6795962 1.6795962 Created orthogonal box = (0.0000000 0.0000000 0.0000000) to (16.795962 16.795962 16.795962) 1 by 1 by 1 MPI processor grid Created 4000 atoms create_atoms CPU = 0.001 seconds Neighbor list info ... update every 20 steps, delay 0 steps, check no max neighbors/atom: 2000, page size: 100000 master list distance cutoff = 2.8 ghost atom cutoff = 2.8 binsize = 1.4, bins = 12 12 12 1 neighbor lists, perpetual/occasional/extra = 1 0 0 (1) pair lj/cut, perpetual attributes: half, newton on pair build: half/bin/atomonly/newton stencil: half/bin/3d/newton bin: standard Setting up Verlet run ... Unit style : lj Current step : 0 Time step : 0.005 Per MPI rank memory allocation (min/avg/max) = 3.222 | 3.222 | 3.222 Mbytes Step Temp E_pair E_mol TotEng Press 0 3 -6.7733681 0 -2.2744931 -3.7033504 50 1.6758903 -4.7955425 0 -2.2823355 5.670064 100 1.6458363 -4.7492704 0 -2.2811332 5.8691042 150 1.6324555 -4.7286791 0 -2.280608 5.9589514 200 1.6630725 -4.7750988 0 -2.2811136 5.7364886 250 1.6275257 -4.7224992 0 -2.281821 5.9567365 Loop time of 0.959769 on 1 procs for 250 steps with 4000 atoms Performance: 112527.020 tau/day, 260.479 timesteps/s 100.0% CPU use with 1 MPI tasks x 1 OpenMP threads MPI task timing breakdown: Section | min time | avg time | max time |%varavg| %total --------------------------------------------------------------- Pair | 0.82686 | 0.82686 | 0.82686 | 0.0 | 86.15 Neigh | 0.098545 | 0.098545 | 0.098545 | 0.0 | 10.27 Comm | 0.014394 | 0.014394 | 0.014394 | 0.0 | 1.50 Output | 0.00025191 | 0.00025191 | 0.00025191 | 0.0 | 0.03 Modify | 0.016137 | 0.016137 | 0.016137 | 0.0 | 1.68 Other | | 0.003585 | | | 0.37 Nlocal: 4000.00 ave 4000 max 4000 min Histogram: 1 0 0 0 0 0 0 0 0 0 Nghost: 5499.00 ave 5499 max 5499 min Histogram: 1 0 0 0 0 0 0 0 0 0 Neighs: 151513.0 ave 151513 max 151513 min Histogram: 1 0 0 0 0 0 0 0 0 0 Total # of neighbors = 151513 Ave neighs/atom = 37.878250 Neighbor list builds = 12 Dangerous builds not checked Total wall time: 0:00:00
Try running LAMMPS with MPI
[testuser@master melt]$ mpirun -np 4 lmp -in in.melt LAMMPS (10 Feb 2021) OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:94) using 1 OpenMP thread(s) per MPI task Lattice spacing in x,y,z = 1.6795962 1.6795962 1.6795962 Created orthogonal box = (0.0000000 0.0000000 0.0000000) to (16.795962 16.795962 16.795962) 1 by 2 by 2 MPI processor grid Created 4000 atoms create_atoms CPU = 0.001 seconds Neighbor list info ... update every 20 steps, delay 0 steps, check no max neighbors/atom: 2000, page size: 100000 master list distance cutoff = 2.8 ghost atom cutoff = 2.8 binsize = 1.4, bins = 12 12 12 1 neighbor lists, perpetual/occasional/extra = 1 0 0 (1) pair lj/cut, perpetual attributes: half, newton on pair build: half/bin/atomonly/newton stencil: half/bin/3d/newton bin: standard Setting up Verlet run ... Unit style : lj Current step : 0 Time step : 0.005 Per MPI rank memory allocation (min/avg/max) = 2.706 | 2.706 | 2.706 Mbytes Step Temp E_pair E_mol TotEng Press 0 3 -6.7733681 0 -2.2744931 -3.7033504 50 1.6754119 -4.7947589 0 -2.2822693 5.6615925 100 1.6503357 -4.756014 0 -2.2811293 5.8050524 150 1.6596605 -4.7699432 0 -2.2810749 5.7830138 200 1.6371874 -4.7365462 0 -2.2813789 5.9246674 250 1.6323462 -4.7292021 0 -2.2812949 5.9762238 Loop time of 0.295548 on 4 procs for 250 steps with 4000 atoms Performance: 365422.843 tau/day, 845.886 timesteps/s 95.5% CPU use with 4 MPI tasks x 1 OpenMP threads MPI task timing breakdown: Section | min time | avg time | max time |%varavg| %total --------------------------------------------------------------- Pair | 0.21616 | 0.22506 | 0.23363 | 1.5 | 76.15 Neigh | 0.02635 | 0.02702 | 0.027748 | 0.4 | 9.14 Comm | 0.026872 | 0.036185 | 0.045796 | 4.2 | 12.24 Output | 0.00010878 | 0.00012555 | 0.00017277 | 0.0 | 0.04 Modify | 0.0042559 | 0.0043301 | 0.0044012 | 0.1 | 1.47 Other | | 0.002825 | | | 0.96 Nlocal: 1000.00 ave 1010 max 982 min Histogram: 1 0 0 0 0 0 1 0 0 2 Nghost: 2703.75 ave 2713 max 2689 min Histogram: 1 0 0 0 0 0 0 2 0 1 Neighs: 37915.5 ave 39239 max 36193 min Histogram: 1 0 0 0 0 1 1 0 0 1 Total # of neighbors = 151662 Ave neighs/atom = 37.915500 Neighbor list builds = 12 Dangerous builds not checked Total wall time: 0:00:00
7.3. Running LAMMPS on our cluster
We will now launch a series of LAMMPS jobs on our cluster to test it. The
LAMMPS input script we will use is the in.lj
found in the lammps/bench/
folder. It’s a simple benchmark with a grid of 32000 atoms. You can adjust the
size of the simulation by passing variables x
, y
and z
to the input
script. This will multiply the grid in each dimension. E.g., with
x=2,y=2,z=2
the grid will be double in size in each dimension and have 8
times the number of atoms (256000). You can use this benchmark to look the
strong and weak scaling.
Here is a Torque job script for running a LAMMPS benchmark on a single node
Note
Notice that we don’t need to specify the number of processors to mpirun
.
Due to the integration with Torque, mpirun
will know how many MPI ranks to
launch and where.
#!/bin/bash
#SBATCH -J lammps_test
#SBATCH -N 1
#SBATCH -n 12
#SBATCH -c 1
#SBATCH -t 10
#SBATCH -o %x-%j.out
export LAMMPS_DIR=$HOME/opt/lammps
export PATH=$LAMMPS_DIR/bin:$PATH
export LD_LIBRARY_PATH=$LAMMPS_DIR/lib:$LD_LIBRARY_PATH
export LAMMPS_POTENTIALS=$LAMMPS_DIR/share/lammps/potentials
cd $HOME/lammps/bench
# run with 32,000 atoms
mpirun lmp -in in.lj -v x 1 -v y 1 -v z 1
# run with 256,000 atoms
# mpirun lmp -in in.lj -v x 2 -v y 2 -v z 2
You can easily modify it to run on multiple nodes by changing the lines
#SBATCH -N 1
#SBATCH -n 12
Do some tests with modifying both N
(number of nodes) and n
(number of tasks) and increase the number
of atoms by modifying the x
, y
and z
parameter. Look at how the Loop time
changes to observe how LAMMPS scales.
Note
If you want to test the effect IB has on the simulation time, you could try running without IB. To do this, pass the following additional parameters to mpirun
:
mpirun --mca btl '^openib' lmp -in in.lj -v x 1 -v y 1 -v z 1