Using Slurm Workload Manager
Our batch system is migrating from Torque and Maui to SLURM.
To run an application users write a job script and submit it using
the sbatch
command. Such submissions are then uniquely identified by
their job ID.
Each job can request a total walltime, as well as a number of processors. Using this information, the scheduler decides when to allocate resources for your job and run them using the batch system.
Queue Limits
Currently jobs have the following limitations:
CB2RR Cluster
-
Normal queue
- maximum number of nodes: 15
- maximum wall time: 120 hours (5 days)
- maximum number of jobs: unlimited
-
GPU queue
- maximum number of nodes: 3
- maximum wall time: 120 hours (5 days)
- maximum number of jobs: unlimited
Job scripts
A job script is simply a shell script with a special comment header. These header additions allow you to specify parameters for your job, such as the resources you need.
The following example illustrates a job script which requests a single processor on a single node and executes a serial program on it.
#!/bin/sh
#SBATCH --time = 01:00:00
#SBATCH --job-name = mytestjob
#SBATCH --ntasks=1 --nodes=1
#SBATCH --partition=normal
#SBATCH --output=mytestjob-%j.out
# change to directory where 'sbatch' was called
cd $SLURM_SUBMIT_DIR
# run my program
./myexecutable
In the example above, the lines beginning with #SBATCH
set job scheduler options
|
Sets the maximum wallclock time the jobs is allowed to run. In this case 1 hour. |
|
Sets the job name as seen in the output of |
|
Specifies the requested number of nodes and, number of tasks. |
|
Specifies the queue in which the job will run. |
|
Specifies the output file of the job's log. Here |
|
Current working directory where |
Controlling Email notifications
Two options can be added to your job scripts to control when and where the batch system sends email notifications about jobs.
|
Tells the batch system to send email if the job begins to run. Other options include: NONE, BEGIN, END, FAIL, REQUEUE, and ALL. |
|
Where to send emails to |
Job Control and Monitoring
sbatch
Submit a job to the batch system
sbatch job_script
scancel
The scancel
command will remove the job specified by
JOBID
from the queue or terminate a job that is executing.
scancel JOBID
squeue
The squeue
command displays information of the queue of jobs:
squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
203 normal mytestjob-1 tuXXXXXX R 0:05 8 c[003-010]
204 normal mytestjob-2 tuXXXXXX R 0:02 8 c[011-018]
205 normal mytestjob-3 tuXXXXXX R 0:02 8 c[020-027]
206 normal mytestjob-4 tuXXXXXX R 0:02 8 c[028-035]
207 normal mytestjob-5 tuXXXXXX R 0:02 8 c[036-043]
208 normal mytestjob-6 tuXXXXXX R 0:02 8 c[044-049,059-060]
All jobs marked with R are running, PD means the job is queued or on hold.
Checking Jobs
If a job behaves strangely or to simply look at more details of how the job
is being viewed by the scheduler you can have a closer look at each job using
the scontrol show job "jobid">
command.
[tuXXXXXX@cb2rr test_slurm]$ scontrol show job 209
JobId=209 JobName=mytestjob-7
UserId=tuXXXXXX(XXXX) GroupId=XXX(XXXXX) MCS_label=N/A
Priority=4294901554 Nice=0 Account=(null) QOS=(null)
JobState=PENDING Reason=Resources Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:00 TimeLimit=04:00:00 TimeMin=N/A
SubmitTime=2023-08-01T13:36:07 EligibleTime=2023-08-01T13:36:07
AccrueTime=2023-08-01T13:36:07
StartTime=2023-08-01T17:34:02 EndTime=2023-08-01T21:34:02 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-08-01T13:37:54 Scheduler=Backfill:*
Partition=normal AllocNode:Sid=cb2rr:3256738
ReqNodeList=(null) ExcNodeList=(null)
NodeList= SchedNodeList=c[003-010]
NumNodes=8-8 NumCPUs=160 NumTasks=160 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=160,mem=1000000M,node=8,billing=160
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/home/tuk02575/test_slurm/test.sh
WorkDir=/home/tuk02575/test_slurm
StdErr=/home/tuk02575/test_slurm/slurm-209.out
StdIn=/dev/null
StdOut=/home/tuk02575/test_slurm/slurm-209.out
Power=
Interactive sessions
A user can submit a request to the job scheduler for an interactive shell session on a compute node. For example, an interactive session request for a single processor node can be requested as follows:
srun -N 1 --partition normal --pty bash -i
The srun
command will not return until a node with the
specified resources becomes available. Once the resources are available, a
shell prompt on the allocated node is presented to the user.
[tuXXXXXX@cb2rr test_slurm]$ srun -N 1 --partition normal --pty bash -i
srun: job 215 queued and waiting for resources
===================================================
Begin TASK Prologue Tue Aug 1 01:44:38 PM EDT 2023
===================================================
Job ID: 215
Username: tuXXXXXX
Group: xxx
Job Name: bash
Resources List: nodes=1:ppn=1:ntasks=1
Queue: normal
Nodes: c001
===================================================
End TASK Prologue Tue Aug 1 01:44:38 PM EDT 2023
===================================================
[tuXXXXXX@c001 ~]$ echo Hello World!
Hello World!
[tuXXXXXX@c001 ~]$ exit
exit
Job Script Examples
MPI jobs using srun
#!/bin/bash
#======================================================
#
# Job script for running a parallel job on multiple
# cores across multiple nodes
#
#======================================================
#======================================================
# Propagate environment variables to the compute node
#SBATCH --export=ALL
#
# Run in the normal partition (queue)
#SBATCH --partition=normal
#
# No. of nodes (see queue limits above)
#SBATCH --nodes=2
#
# No. of tasks (CPUs) required (see queue limits above)
#SBATCH --ntasks=40
#
# Specify (hard) runtime (HH:MM:SS)
#SBATCH --time=01:00:00
#
# Job name
#SBATCH --job-name=mpi_test
#
# Output file
#SBATCH --output=mpi_test-%j.out
#======================================================
# change to directory where 'sbatch' was called
cd $SLURM_SUBMIT_DIR
# Load Modules
module load mpi/openmpi
# Modify the line below to run your program
srun -n $SLURM_NTASKS ./my_mpi_application.x
MPI jobs using mpirun
#!/bin/bash
#======================================================
#
# Job script for running a parallel job on multiple
# cores across multiple nodes
#
#======================================================
#======================================================
# Propagate environment variables to the compute node
#SBATCH --export=ALL
#
# Run in the normal partition (queue)
#SBATCH --partition=normal
#
# No. of nodes (see queue limits above)
#SBATCH --nodes=2
#
# No. of tasks (CPUs) required (see queue limits above)
#SBATCH --ntasks=40
#
# Specify (hard) runtime (HH:MM:SS)
#SBATCH --time=01:00:00
#
# Job name
#SBATCH --job-name=mpi_test
#
# Output file
#SBATCH --output=mpi_test-%j.out
#======================================================
# change to directory where 'sbatch' was called
cd $SLURM_SUBMIT_DIR
# Load Modules
module load mpi/openmpi
# Modify the line below to run your program
mpirun -np $SLURM_NTASKS ./my_mpi_application.x
GPU jobs
#!/bin/bash
#======================================================
#
# Job script for running a parallel job on a single gpu
#
# Submit as follows:
#
#======================================================
#======================================================
# Propagate environment variables to the compute node
#SBATCH --export=ALL
#
# Run in the gpu partition (queue)
#SBATCH --partition=gpu
#
# Total number GPUs for the job
#SBATCH --gpus=2
#
# Number of GPUs to use per node (max 2)
#SBATCH --gpus-per-node=2
#
# Number of CPUs per GPU
#SBATCH --cpus-per-gpu=1
#
# Specify (hard) runtime (HH:MM:SS)
#SBATCH --time=01:00:00
#
# Job name
#SBATCH --job-name=gpu_test
#
# Output file
#SBATCH --output=gpu_test-%j.out
#======================================================
# Load CUDA always
module load cuda
# change to directory where 'sbatch' was called
cd $SLURM_SUBMIT_DIR
srun --gpus 1 ./my_gpu_application.x