High-Performance Computing

-->

Using Slurm Workload Manager

This page is under development!

Our batch system is migrating from Torque and Maui to SLURM.

To run an application users write a job script and submit it using the sbatch command. Such submissions are then uniquely identified by their job ID.

Each job can request a total walltime, as well as a number of processors. Using this information, the scheduler decides when to allocate resources for your job and run them using the batch system.

Queue Limits

Currently jobs have the following limitations:

CB2RR Cluster

Normal queue
- maximum number of nodes: 15
- maximum wall time: 120 hours (5 days)
- maximum number of jobs: unlimited
GPU queue
- maximum number of nodes: 3
- maximum wall time: 120 hours (5 days)
- maximum number of jobs: unlimited

Job scripts

A job script is simply a shell script with a special comment header. These header additions allow you to specify parameters for your job, such as the resources you need.

The following example illustrates a job script which requests a single processor on a single node and executes a serial program on it.

#!/bin/sh
#SBATCH --time = 01:00:00
#SBATCH --job-name = mytestjob
#SBATCH --ntasks=1 --nodes=1
#SBATCH --partition=normal
#SBATCH --output=mytestjob-%j.out

# change to directory where 'sbatch' was called
cd $SLURM_SUBMIT_DIR

# run my program
./myexecutable

In the example above, the lines beginning with #SBATCH set job scheduler options

`#SBATCH --time = 01:00:00`	Sets the maximum wallclock time the jobs is allowed to run. In this case 1 hour.
`#SBATCH --job-name = mytestjob`	Sets the job name as seen in the output of `squeue` command
`#SBATCH --ntasks=1 --nodes=1`	Specifies the requested number of nodes and, number of tasks.
`#SBATCH --partition=normal`	Specifies the queue in which the job will run.
`#SBATCH --output=mytestjob-%j.out`	Specifies the output file of the job's log. Here `%j` specifies the job id.
`$SLURM_SUBMIT_DIR`	Current working directory where `sbatch` command was issued.

Controlling Email notifications

Two options can be added to your job scripts to control when and where the batch system sends email notifications about jobs.

`#SBATCH --mail-type=BEGIN`	Tells the batch system to send email if the job begins to run. Other options include: NONE, BEGIN, END, FAIL, REQUEUE, and ALL.
`#SBATCH --mail-user=testuser@temple.edu`	Where to send emails to

Job Control and Monitoring

`sbatch`

Submit a job to the batch system

sbatch job_script

`scancel`

The scancel command will remove the job specified by JOBID from the queue or terminate a job that is executing.

scancel JOBID

`squeue`

The squeue command displays information of the queue of jobs:

squeue
  JOBID PARTITION    NAME        USER      ST      TIME  NODES NODELIST(REASON)
  203    normal      mytestjob-1 tuXXXXXX  R       0:05      8 c[003-010]
  204    normal      mytestjob-2 tuXXXXXX  R       0:02      8 c[011-018]
  205    normal      mytestjob-3 tuXXXXXX  R       0:02      8 c[020-027]
  206    normal      mytestjob-4 tuXXXXXX  R       0:02      8 c[028-035]
  207    normal      mytestjob-5 tuXXXXXX  R       0:02      8 c[036-043]
  208    normal      mytestjob-6 tuXXXXXX  R       0:02      8 c[044-049,059-060]

All jobs marked with R are running, PD means the job is queued or on hold.

Checking Jobs

If a job behaves strangely or to simply look at more details of how the job is being viewed by the scheduler you can have a closer look at each job using the scontrol show job "jobid"> command.


[tuXXXXXX@cb2rr test_slurm]$ scontrol show job 209
  JobId=209 JobName=mytestjob-7
  UserId=tuXXXXXX(XXXX) GroupId=XXX(XXXXX) MCS_label=N/A
  Priority=4294901554 Nice=0 Account=(null) QOS=(null)
  JobState=PENDING Reason=Resources Dependency=(null)
  Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
  RunTime=00:00:00 TimeLimit=04:00:00 TimeMin=N/A
  SubmitTime=2023-08-01T13:36:07 EligibleTime=2023-08-01T13:36:07
  AccrueTime=2023-08-01T13:36:07
  StartTime=2023-08-01T17:34:02 EndTime=2023-08-01T21:34:02 Deadline=N/A
  SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-08-01T13:37:54 Scheduler=Backfill:*
  Partition=normal AllocNode:Sid=cb2rr:3256738
  ReqNodeList=(null) ExcNodeList=(null)
  NodeList= SchedNodeList=c[003-010]
  NumNodes=8-8 NumCPUs=160 NumTasks=160 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
  TRES=cpu=160,mem=1000000M,node=8,billing=160
  Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
  MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
  Features=(null) DelayBoot=00:00:00
  OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
  Command=/home/tuk02575/test_slurm/test.sh
  WorkDir=/home/tuk02575/test_slurm
  StdErr=/home/tuk02575/test_slurm/slurm-209.out
  StdIn=/dev/null
  StdOut=/home/tuk02575/test_slurm/slurm-209.out
  Power=

Interactive sessions

A user can submit a request to the job scheduler for an interactive shell session on a compute node. For example, an interactive session request for a single processor node can be requested as follows:

srun -N 1 --partition normal --pty bash -i

The srun command will not return until a node with the specified resources becomes available. Once the resources are available, a shell prompt on the allocated node is presented to the user.

[tuXXXXXX@cb2rr test_slurm]$ srun -N 1 --partition normal --pty bash -i
srun: job 215 queued and waiting for resources

===================================================
Begin TASK Prologue Tue Aug  1 01:44:38 PM EDT 2023
===================================================
Job ID:           215
Username:         tuXXXXXX
Group:            xxx
Job Name:         bash
Resources List:   nodes=1:ppn=1:ntasks=1
Queue:            normal
Nodes:      c001
===================================================
End TASK Prologue Tue Aug  1 01:44:38 PM EDT 2023
===================================================
[tuXXXXXX@c001 ~]$ echo Hello World!
Hello World!
[tuXXXXXX@c001 ~]$ exit
exit

Job Script Examples

MPI jobs using srun

#!/bin/bash
#======================================================
#
# Job script for running a parallel job on multiple 
# cores across multiple nodes
#
#======================================================

#======================================================
# Propagate environment variables to the compute node
#SBATCH --export=ALL
#
# Run in the normal partition (queue)
#SBATCH --partition=normal
#
# No. of nodes (see queue limits above)
#SBATCH --nodes=2
#
# No. of tasks (CPUs) required (see queue limits above)
#SBATCH --ntasks=40
#
# Specify (hard) runtime (HH:MM:SS)
#SBATCH --time=01:00:00
#
# Job name
#SBATCH --job-name=mpi_test
#
# Output file
#SBATCH --output=mpi_test-%j.out
#======================================================

# change to directory where 'sbatch' was called
cd $SLURM_SUBMIT_DIR

# Load Modules
module load mpi/openmpi

# Modify the line below to run your program
srun -n $SLURM_NTASKS  ./my_mpi_application.x

MPI jobs using mpirun

#!/bin/bash
#======================================================
#
# Job script for running a parallel job on multiple 
# cores across multiple nodes
#
#======================================================

#======================================================
# Propagate environment variables to the compute node
#SBATCH --export=ALL
#
# Run in the normal partition (queue)
#SBATCH --partition=normal
#
# No. of nodes (see queue limits above)
#SBATCH --nodes=2
#
# No. of tasks (CPUs) required (see queue limits above)
#SBATCH --ntasks=40
#
# Specify (hard) runtime (HH:MM:SS)
#SBATCH --time=01:00:00
#
# Job name
#SBATCH --job-name=mpi_test
#
# Output file
#SBATCH --output=mpi_test-%j.out
#======================================================

# change to directory where 'sbatch' was called
cd $SLURM_SUBMIT_DIR

# Load Modules
module load mpi/openmpi

# Modify the line below to run your program
mpirun -np $SLURM_NTASKS  ./my_mpi_application.x

GPU jobs

#!/bin/bash
#======================================================
#
# Job script for running a parallel job on a single gpu
#
# Submit as follows:
#
#======================================================

#======================================================
# Propagate environment variables to the compute node
#SBATCH --export=ALL
#
# Run in the gpu partition (queue)
#SBATCH --partition=gpu
#
# Total number GPUs for the job
#SBATCH --gpus=2
#
# Number of GPUs to use per node (max 2)
#SBATCH --gpus-per-node=2
#
# Number of CPUs per GPU
#SBATCH --cpus-per-gpu=1
#
# Specify (hard) runtime (HH:MM:SS)
#SBATCH --time=01:00:00
#
# Job name
#SBATCH --job-name=gpu_test
#
# Output file
#SBATCH --output=gpu_test-%j.out
#======================================================

# Load CUDA always
module load cuda

# change to directory where 'sbatch' was called
cd $SLURM_SUBMIT_DIR

srun --gpus 1 ./my_gpu_application.x