Overview and Architecture

Slurm has a centralized manager, slurmctld, to monitor resources and work. There may also be a backup manager to assume those responsibilities in the event of failure (not used in this exercise. Each compute node has a slurmd daemon, which can be compared to a remote shell: it waits for work, executes that work, returns status, and waits for more work.

The slurmd daemons provide fault-tolerant hierarchical communications. There is an optional slurmdbd (Slurm DataBase Daemon) which can be used to record accounting information for multiple Slurm-managed clusters in a single database. There is an optional . User tools include srun to initiate jobs, scancel to terminate queued or running jobs, sinfo to report system status, squeue to report the status of jobs, and sacct to get information about jobs and job steps that are running or have completed.

The sview commands graphically reports system and job status including network topology. There is an administrative tool scontrol available to monitor and/or modify configuration and state information on the cluster. The administrative tool used to manage the database is sacctmgr. It can be used to identify the clusters, valid users, valid bank accounts, etc.

Slurm Components. (taken from https://slurm.schedmd.com/overview.html)

In this exercise we will install and configure slurmctld and slurmd componenets and optionally configure slurmdbd to manage user accounts.