Slurm
Slurm is an open-source job scheduler for Linux and Unix-like kernels.
Contents
SLURM Directives
SLURM directives are job options that constrain the job to the conditions specified. Directives can be identified by the syntax `#SBATCH <flag>`. These directives can be used with srun and within a batch script (sbatch). Commonly used flags are listed below.
Resource | Syntax | Example | Description |
---|---|---|---|
Account | --account=<account> | --account=slurmgeneral | entity which resources are charged to. available accounts |
Partition | --partition=<partition> | --partition=slurm-general-01 | where job resources are allocated. available partitions |
Job Name | --job-name=<filename> | --job-name=testprogram | name of job to be queued |
Task | --ntask=<number> | --ntask=2 | useful for commands to be ran in parallel |
Memory | --mem=<size>[units] | --mem=1gb | memory to be allocated for job |
CPU | --cpus-per-task=<number> | --cpus-per-task=16 | CPUs to be allocated per job task |
Output | --output=<filename> | --output=testprogram.log | name of job output file |
Time | --time= | --time=01:00:00 | time limit for job |
SRUN
srun is used to submit jobs for execution in real time. Also used to create job steps.
srun example
srun --partition slurm-general-01 --account=slurmgeneral myprogram # shell on compute node # specifying which partition and account (applicable if assigned multiple accounts)
srun myprogram.sh # shell on compute node # default partition and account is used when not specified # default partition is slurm-general
SBATCH
Sbatch is a command used to submit jobs via batch scripts to SLURM.
batch script example
#!/bin/bash -l #SBATCH --job-name=testprogram # job name #SBATCH --partition=slurm-general-01 # specifying which partition to run job on, if omitted default partition will be used (slurm-general) #SBATCH --account=slurmgeneral # only applicable if user is assigned multiple accounts #SBATCH --ntasks=1 # commands to run in parallel #SBATCH --time=10:00:00 #SBATCH --mem=1gb # request 1gb of memory #SBATCH --output=testprogram.log # output and error log date sleep 10 python3 someProgram.py date
submitting a job using sbatch
sbatch myprogram.sh # queue job using a batch script. Default partition will be used.
sbatch --partition slurm-general-01 --account=slurmgeneral myprogram.sh # batch script specifying which partition and account when not specified within the batch script
Slurm Command Line
List of basic slurm commands and their description.
Command | Syntax | Example | Description |
---|---|---|---|
Job submission | sbatch <filename> | sbatch batch_file.sh | Submit a batch script to Slurm |
Job submission | srun <resource-parameters> <filename> | srun batch_file.sh / srun echo "`Testing`" | Submit and run a job interactively. Also used for job steps within a sbatch script |
Job Deletion | scancel <job-id> | scancel 100 | Stop or cancel submitted job |
Job status | squeue -u <username> | squeue -u cs_HPCuser001 | Check job status by user |
Cluster resources | sinfo | sinfo | Current status about compute nodes within the HPC cluster |
Compute Resources
The ODU CS department HPC cluster is comprised of multiple partitions where users can submit jobs. Each partition can only be accessed by users who are assigned to the partitions respective account. Not all partitions can be accessed by all users.
Cluster | Partition | Node(s) | Account(s) |
---|---|---|---|
slurm-cluster | slurm-general | slurm-a40-collab, slurm-a40-collab-2, slurm-p100-collab | slurmgeneral, shaoresearch, fwangresearch |
slurm-cluster | lusiliresearch | slurm-a40-collab, slurm-a40-collab-2, slurm-p100-collab | slurmgeneral, shaoresearch, fwangresearch |
slurm-cluster | haoresearch | slurm-a6000-hao | shaoresearch |
slurm-cluster | wangresearch | slurm-a6000-wang | fwangresearch |
Node | GPU(s) | CPU(s) | RAM | |
---|---|---|---|---|
slurm-a40-collab | Nvidia A40 (8) | 96 | 375GB | |
slurm-a40-collab-2 | Nvidia A40 (7) | 96 | 500GB | |
slurm-p100-collab | Nvidia P100 (4) | 40 | 181GB | |
slurm-a6000-hao | Nvidia A6000 (4) | 96 | 515GB | |
slurm-a6000-wang | Nvidia A6000 (8) | 96 | 385GB |
Accessing the HPC Cluster
To gain access to the HPC cluster email root@cs.odu.edu. After you have been given access, you can connect to the HPC cluster by issuing the command below.
ssh cs_username001@slurm-manager.cs.odu.edu
Jupyter Notebook
Jupyter notebook is available for use via an apptainer container. Containers are located in /mnt/apptainer/. Below is an example on how to launch Jupyter notebook.
cs_user001@slurm-manager:~$ srun --cpus-per-task=16 --mem=50GB apptainer run --nv /mnt/apptainer/jupyter-gpu.sif
Troubleshooting
How to view assigned account
sacctmgr show association -p user=$username
Errors regarding SBATCH file
chmod +x sbatch_filename.sh