Wilson HPC Computing Facility

SLURM (Simple Linux Utility For Resource Management) is a very powerful open source, fault-tolerant, and highly scalable resource manager and job scheduling system of high availability currently developed by SchedMD. Initially developed for large Linux Clusters at the Lawrence Livermore National Laboratory, SLURM is used extensively on most Top 500 supercomputers around the globe.

Contents

  1. Commands
  2. User Accounts
  3. Resource Types
  4. Using SLURM: examples
  5. SLURM Reporting
  6. Comparison between PBS/Torque and SLURM
  7. SLURM Environment Variables
  8. More Information

SLURM Commands

  • Custom queue monitor command lsqueue.
  • Job control and monitoring are performed by scontrol and squeue.
  • Batch jobs are submitted with sbatch.
  • Interactive job sessions are requested through salloc.
  • The command to launch a job is srun.
  • Nodes info and cluster status may be requested with sinfo.
  • Job and job steps accounting data can be accessed with sacct.
  • Useful environment variables are $SLURM_NODELIST and $SLURM_JOBID.

SLURM User Accounts

In order to check your "default" SLURM account use the following command:

[@tev ~]$ sacctmgr list user name=johndoe
      User   Def Acct     Admin
---------- ---------- ---------
   johndoe   projectx      None

To check "all" the SLURM accounts you are associated with use the following command:

[@tev ~]$ sacctmgr list user name=johndoe withassoc
      User   Def Acct     Admin    Cluster    Account  Partition     
---------- ---------- --------- ---------- ---------- ---------- 
   johndoe  projectx      None     wilson      alpha
   johndoe  projectx      None     wilson       beta
   johndoe  projectx      None     wilson      gamma

NOTE: If you do not specify an account name during your job submission (using --account), the "default" account (associated with your username) will be used to track usage.

SLURM Resource Types

There are currently two types of CPU resources, four types of GPU resources and one type of KNL resource that can be requested.

SLURM TypeResource TypeDescriptionNumber of resourcesNumber of tasks per resourceSLURM Partition (or queue name)NodenamesShared Resource **
--constraint--nodes--ntasks-per-node--partition--nodelist--exclusive
intelcpuCPUIntel Xeon x5650_2.67GHz2512intel12tev[0101-0113,0201-0211,0213]No
amdcpuCPUAMD Opteron 61283232amd32tev[0301-0312,0401-0412,0501-0510]No
k20GPUNVIDIA Kepler K2012gpugpu1Yes
k40GPUNVIDIA Kepler K4012gpugpu2Yes
p100nvlinkGPUNVIDIA P100 with NVLINK between GPUs12gpugpu3Yes
p100GPUNVIDIA P100 with NO NVLINK between GPUs18gpugpu4Yes
v100nvlinkPPC64IBM P9 .. NVIDIA V100 with NVLINK between GPUs14ppc64ibmpower9Yes
intelknlKNLIntel Knights Landing Developers Edition1512knlknl1No
** Once assigned to a user job this resource is either by default shared (Yes) (if sufficient resources are available) or allocated exclusively (No) to a single user job.

Default Walltime Limits on the Wilson cluster

Most of our users do not request a particular WallTime limit in their jobs. They simply use the default value, which has typically been their MaxTime limit. In almost all cases our MaxTime limit is 24 hours. This MaxTime limit can be set for a partition, a QoS or an Account/Project.

In an effort to improve throughput and scheduling on the cluster, we are setting a Default WallTime of 8 hours across most partitions. It is much easier and more efficient to backfill empty nodes with 8 hour jobs than longer 24 hour jobs. We remind users that realistic estimates for job completion times also improves scheduling effciency. Users are still able to submit and run 24 hour jobs, but will now need to specify the WallTime needed by their jobs if they want something other than 8 hours. Use the '--time=24:00:00' option to set the walltime to 24 hours.

Using SLURM on the Wilson cluster: examples

  • Submit an interactive job requesting 12 Intel-based nodes
    [@tev ~]$ srun --pty --nodes=12 --ntasks-per-node=12 --partition intel12 bash
    [user@tev0101 ~]$ env | grep NTASKS
    SLURM_NTASKS_PER_NODE=12
    SLURM_NTASKS=144
    [user@tev0101 ~] exit
    

  • Submit an interactive job requesting 4 GPUs
    [@tev ~]$ srun --pty --gres=gpu:X --partition gpu bash
    [user@gpu4 ~] nvidia-smi -L
    GPU 0: Tesla P100-PCIE-16GB (UUID: GPU-2b1388c6-1375-0adf-55fb-869be8e4bd61)
    GPU 1: Tesla P100-PCIE-16GB (UUID: GPU-97ad1311-5fa3-9e1a-d606-d769f9de0c3e)
    GPU 2: Tesla P100-PCIE-16GB (UUID: GPU-3a7dec19-e0a3-23d8-3b97-acd1e68b9e08)
    GPU 3: Tesla P100-PCIE-16GB (UUID: GPU-d1dfed5b-9d28-85fc-a406-9fc9af6f039f)
    [user@gpu4 ~] exit
    

  • Submit an interactive job requesting a particular resource type
    [@tev ~]$ srun --pty --gres=gpu:2 --constraint=k40 --partition gpu bash
    [user@gpu2 ~]$ nvidia-smi -L
    GPU 0: Tesla K40m (UUID: GPU-c59d93c0-c2f8-605c-4eba-9a0003f41894)
    GPU 1: Tesla K40m (UUID: GPU-58caf506-83cf-b18c-20e6-6acfa44b05e7)
    [user@gpu2 ~]$ exit
    

  • Submit an interactive job requesting a particular GPU host
    [@tev ~]$ srun --pty --gres=gpu:1 --nodelist=gpu4 --partition gpu bash
    [user@gpu4 ~]$ nvidia-smi -L
    GPU 0: Tesla P100-PCIE-16GB (UUID: GPU-2b1388c6-1375-0adf-55fb-869be8e4bd61)
    [user@gpu4 ~]$ exit
    

  • Submit an interactive job on a KNL host
    [@tev ~]$ srun --pty -N 1 --partition knl bash
    [@knl1 ~]$ 
    

  • Submit a batch job requesting 8 GPUs
    [@tev ~]$ cat myscript.sh
    #!/bin/sh
    #SBATCH --job-name=test
    #SBATCH --gres=gpu:2
    #SBATCH --nodelist=gpu3
    #SBATCH --partition gpu
    #SBATCH --qos=normal
    #SBATCH --time=480
    
    nvidia-smi -L
    sleep 5
    exit
    
    [@tev ~]$ sbatch myscript.sh
    Submitted batch job 46
    
    [@tev ~]$ lsqueue
                  JOBID PARTITION     NAME     USER ST       TIME NODES NODELIST(REASON) GRES
                     46       gpu   pbs.sh   amitoj  R       0:02     1             gpu3 gpu:2
    

    NOTE: The above command shows GRES as gpu:2 indicating that 2 GPUs are being used by jobid 46. The default squeue command will not print the GRES column and hence one has to use the long --format string with it. In the future we may wrapper this command. Once the batch job completes the output is available as follows.

    [@tev ~]$ cat slurm-46.out
    GPU 0: Tesla P100-SXM2-16GB (UUID: GPU-3dcc1d47-41af-113f-1f28-f0cb6f837885)
    GPU 1: Tesla P100-SXM2-16GB (UUID: GPU-ce7c94c8-f84f-5bab-d7d8-f7d1efc45432)
    

SLURM reporting

sreport is used to generate reports of job usage and cluster utilization for SLURM jobs saved to the SLURM Database. sreport should be run on the host tev.fnal.gov. A few worthwhile examples follow:

[@tev ~]$ sreport cluster AccountUtilizationByUser user=johndoe start=2018-03-01 -t percent
--------------------------------------------------------------------------------
Cluster/Account/User Utilization 2018-03-01T00:00:00 - 2018-04-25T23:59:59 (4834800 secs)
Use reported in Percentage of Total
--------------------------------------------------------------------------------
  Cluster         Account     Login     Proper Name     Used   Energy
--------- --------------- --------- --------------- -------- --------
   wilson           alpha   johndoe        John Doe   56.27%    0.00%

[@tev ~]$ sreport user Top Start=04/20/18 End=04/26/18 -t percent
--------------------------------------------------------------------------------
Top 10 Users 2018-04-20T00:00:00 - 2018-04-25T23:59:59 (518400 secs)
Use reported in Percentage of Total
--------------------------------------------------------------------------------
  Cluster     Login     Proper Name         Account     Used   Energy
--------- --------- --------------- --------------- -------- --------
   wilson   johndoe        John Doe           alpha   56.27%    0.00%
   wilson   janedoe        Jane Doe          matter   30.16%    0.00%
.....
<--snipped--->

Comparison between PBS/Torque and SLURM



PBS/Torque

SLURM

submit command

qsub

sbatch, srun, salloc




walltime request

#PBS -l walltime hh:mm:ss

#SBATCH --time=hh:mm:ss

(or -t hh:mm:ss)

specific node request

#PBS -l nodes=X:ppn=Y:gpus=2

#SBATCH --nodes=X (or -N X)

#SBATCH --cpus-per-task=Y(or -c Y)[for OpenMP or hybrid code]

#SBATCH --ntasks-per-node=Y(or -n Y)[to set equal # of tasks/node or MPI]
#SBATCH --n Z [set total number of tasks]

#SBATCH --gres=gpu:2

define memory

#PBS -l mem=Zgb

#SBATCH --mem=Zgb

define number of procs/node


#SBATCH -c <# of cpus/task> -> for OpenMP/hybrid jobs
#SBATCH -n <# of total tasks or processors -> for MPI jobs

queue request

#PBS -q gpu

#SBATCH -p gpu

group account

#PBS -A <Account>

#SBATCH -A <Account>

job name

#PBS -N <name>

#SBATCH -J <name>

output file name

#PBS -O <filename>

#SBATCH -o <name>.o%j

where %j is the jobID

email option

#PBS -m e

#SBATCH --mail-type=end

options: begin,end,fail,all

email address

#PBS -M <email address>

#SBATCH --mail-user=<email>

count processors

NPROCS=`wc -l < $PBS_NODEFILE`NPROCS=$(( $SLURM_NNODES * $SLURM_CPUS_PER_TASK ))#for OpenMP & hybrid (MPI + OpenMP) jobs

NPROCS=$SLURM_NPROCS or $SLURM_NTASKS #for MPI jobs
Note: -N 4 -n 16 => 16 processors NOT 4*16=64 processors

if $PBS_NODEFILE is needed, then include the following lines:

PBS_NODEFILE=`generate_pbs_nodefile`
NPROCS=`wc -l < $PBS_NODEFILE`

starting directory on the compute node

user home directory

the working (submit) directory

node names

tevXXXX or gpuX

tevXXXX or gpuX

interactive job request

qsub -I -X

srun --pty /bin/bash

reserve the node exclusively

srun --exclusive --pty /bin/bash

dependency

#PBS -d <jobid>

#SBATCH -d <after:jobid>

SLURM Environment Variables

Slurm Job Environment Variables
Slurm Variable Name Description Example values PBS/Torque analog
$SLURM_JOB_ID Job ID 5741192 $PBS_JOBID
$SLURM_JOBID Deprecated. Same as SLURM_JOB_ID    
$SLURM_JOB_NAME Job Name myjob $PBS_JOBNAME
$SLURM_SUBMIT_DIR Submit Directory /data/dune $PBS_O_WORKDIR
$SLURM_JOB_NODELIST Nodes assigned to job tev01[01-05] cat $PBS_NODEFILE
$SLURM_SUBMIT_HOST Host submitted from tev.fnal.gov $PBS_O_HOST
$SLURM_JOB_NUM_NODES Number of nodes allocated to job 2 $PBS_NUM_NODES
$SLURM_CPUS_ON_NODE Number of cores/node 8,3 $PBS_NUM_PPN
$SLURM_NTASKS Total number of cores for job 11 $PBS_NP
$SLURM_NODEID Index to node running on
relative to nodes assigned to job
0 $PBS_O_NODENUM
$PBS_O_VNODENUM Index to core running on
within node
4 $SLURM_LOCALID
$SLURM_PROCID Index to task relative to job 0 $PBS_O_TASKNUM - 1
$SLURM_ARRAY_TASK_ID Job Array Index 0 $PBS_ARRAYID

More Information

Contact: Amitoj Singh
Last modified: June 7, 2019