Using the Batch System

Compute supports running application using a batch system. This batch system is responsible for launching applications on batch nodes (c01 to c08). Nodes are shared between users. That means that if one user requests only a few processors, other jobs can still run on the remaining resources.

Our batch system consists of the Torque resource manager and the Maui scheduler. Torque keeps track of the state of all compute nodes and controls each job execution. The scheduler determines when and where jobs are run. Its overall goal is to maximize utilization of the resources and give all users a fair chance of running their jobs.

To run an application users write a job script and submit it using the qsub command. Such submissions are then uniquely identified by their job ID.

Each job can request a total walltime, as well as a number of processors. Using this information, the scheduler decides when to allocate resources for your job and run them using the batch system.


Queue Limits

Currently Compute jobs have the following limitations:

  • one node per job
  • maximum wall time of 168 hours (7 days)

Job scripts

A job script is simply a shell script with a special comment header. These header additions allow you to specify parameters for your job, such as the resources you need.

The following example illustrates a job script which requests a single processor on a single node and executes a serial program on it.

#!/bin/sh
#PBS -l walltime=1:00:00
#PBS -N mytestjob
#PBS -l nodes=1:ppn=1

# change to directory where 'qsub' was called
cd $PBS_O_WORKDIR

# run my program
./myexecutable

In the example above, the lines beginning with #PBS set job scheduler options

#PBS ­-l walltime=1:00:0

Sets the maximum wallclock time the jobs is allowed to run. In this case 1 hour.

#PBS ­-N mytestjob

Sets the job name as seen in the output of qstat command

#PBS ­-l nodes=1:ppn=1

Specifies the requested number of nodes and, processors per node.

$PBS_O_WORKDIR

Current working directory where qsub command was issued.

Controlling Email notifications

Two options can be added to your job scripts to control when and where the batch system sends email notifications about jobs.

#PBS -m bae

Tells the batch system to send email if the job begins to run (b), aborts (a) or when it ends (e). Remove any letter to avoid that particular message.

#PBS -M testuser@temple.edu

Where to send emails to


Job Control and Monitoring

qsub

Submit a job to the batch system

qsub job_script

qdel

The qdel command will remove the job specified by JOBID from the queue or terminate a job that is executing.

qdel JOBID

qstat

The qstat command displays information of the queue of jobs:

qstat
Job ID                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
101                        testJob1         tuXXXXXX        1:55: R compute
102                        testJob2         tuXXXXXX        2:48: R compute
103                        testJob3         tuXXXXXX        3:36: R compute
104                        testJob4         tuXXXXXX        1:52: R compute
105                        testJob5         tuXXXXXX            0 Q compute
106                        testJob6         tuXXXXXX            0 Q compute
107                        testJob7         tuXXXXXX            0 Q compute
108                        testJob8         tuXXXXXX            0 H compute
109                        testJob9         tuXXXXXX            0 H compute
110                        testJob10        tuXXXXXX            0 H compute

All jobs marked with R are running, Q means the job is queued, and H that the job is on hold.

showq

This command shows you the queue as seen by the Maui scheduler.

showq
ACTIVE JOBS--------------------
JOBNAME            USERNAME      STATE  PROC   REMAINING            STARTTIME

186             tuXXXXXX    Running    12    00:08:26  Fri Mar  1 06:17:12
187             tuXXXXXX    Running    12    00:08:46  Fri Mar  1 06:17:32
188             tuXXXXXX    Running     8    00:10:06  Fri Mar  1 10:20:54
189             tuXXXXXX    Running     8    00:10:46  Fri Mar  1 10:21:22
190             tuXXXXXX    Running     4    00:14:46  Fri Mar  1 09:17:32
195             tuXXXXXX    Running     4    00:14:30  Fri Mar  1 10:34:24
201             tuXXXXXX    Running    24    00:16:46  Fri Mar  1 14:24:05
220             tuXXXXXX    Running    24    00:16:46  Fri Mar  1 14:48:23

    8 Active Jobs    96 of 96 Processors Active (100.00%)
                       2 of  2 Nodes Active      (100.00%)

IDLE JOBS----------------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME

202                tuXXXXXX       Idle     4  1:00:40:00  Fri Mar  1 15:00:48
205                tuXXXXXX       Idle     4     2:15:00  Fri Mar  1 13:00:49
209                tuXXXXXX       Idle     4     2:15:00  Fri Mar  1 13:01:47

1 Idle Jobs

BLOCKED JOBS----------------
JOBNAME            USERNAME      STATE  PROC     WCLIMIT            QUEUETIME

286                tuXXXXXX       Idle    48  2:00:00:00  Tue Feb 26 15:02:13

Total Jobs: 12   Active Jobs: 8   Idle Jobs: 3   Blocked Jobs: 1

checkjob

If a job behaves strangly or to simply look at more details of how the job is being viewed by the scheduler you can have a closer look at each job using the checkjob command.


checkjob 286

checking job 286

State: Idle
Creds:  user:tuXXXXXX  group:phys  class:compute  qos:computeqos
WallTime: 00:00:00 of 2:00:00:00
SubmitTime: Tue Feb 26 15:02:13
  (Time Queued  Total: 3:00:16:01  Eligible: 1:19:35:04)

Total Tasks: 48

Req[0]  TaskCount: 48  Partition: ALL
Network: [NONE]  Memory >= 0  Disk >= 0  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [broadwell]


IWD: [NONE]  Executable:  [NONE]
Bypass: 5  StartCount: 0
PartitionMask: [ALL]
Flags:       RESTARTABLE DEDICATEDNODE

PE:  48.00  StartPriority:  2607
cannot select job 239286 for partition DEFAULT (job 286 violates active HARD MAXPS limit of 400000 for user tuXXXXXX  (R: 8294400, U: 37167928)
)


Interactive sessions

A user can submit a request to the job scheduler for an interactive shell session on a compute node. For example, an interactive session request for a single processor node can be requested as follows:

qsub -I -l nodes=1:ppn=1

The qsub command will not return until a node with the specified resources becomes available. Once the resources are available, a shell prompt on the allocated node is presented to the user.

[tuXXXXXX@compute ~]$ qsub -I -l nodes=1:ppn=1
qsub: waiting for job 1 to start
qsub: job 1 ready

----------------------------------------
Begin Batch Job Prologue Fri Mar  1 14:44:50 EST 2019
Job ID:           1
Username:         tuXXXXXX
Group:            hpc
Job Name:         STDIN
Resources List:   nodes=1:ppn=1,neednodes=1:ppn=1,walltime=00:30:00
Queue:            compute
Account:          
Nodes:            compute3 
----------------------------------------
End Batch Job Prologue Fri Mar  1 14:44:51 EST 2019
----------------------------------------
[tuXXXXXX@compute3 ~]$ echo Hello World!
Hello World!
[tuXXXXXX@compute3 ~]$ exit
logout

qsub: job 1 completed

Without further qsub parameters this job will be terminated after the default walltime of 30 minutes or when the user exits the session.