Difference between revisions of "Slurm"

Revision as of 15:22, 11 April 2023

Slurm is an open-source job scheduler for Linux and Unix-like kernels.

SLURM Directives

SLURM directives are job options that constrain the job to the conditions specified. Directives can be identified by the syntax `#SBATCH <flag>`. These directives can be used with srun and within a batch script (sbatch). Commonly used flags are listed below.

Resource	Syntax	Example	Description
Account	--account=<account>	--account=slurmgeneral	entity which resources are charged to. available accounts
Partition	--partition=<partition>	--partition=slurm-general-01	where job resources are allocated. available partitions
Job Name	--job-name=<filename>	--job-name=testprogram	name of job to be queued
Task	--ntask=<number>	--ntask=2	useful for commands to be ran in parallel
Memory	--mem=<size>[units]	--mem=1gb	memory to be allocated for job
CPU	--cpus-per-task=<number>	--cpus-per-task=16	CPUs to be allocated per job task
Output	--output=<filename>	--output=testprogram.log	name of job output file
Time	--time=	--time=01:00:00	time limit for job

SRUN

srun is used to submit jobs for execution in real time. Also used to create job steps.

srun example

srun --partition slurm-general-01 --account=slurmgeneral myprogram           # shell on compute node
                                                                             # specifying which partition and account (applicable if assigned multiple accounts)

srun myprogram.sh                                                            # shell on compute node
                                                                             # default partition and account is used when not specified
                                                                             # default partition is slurm-general

SBATCH

Sbatch is a command used to submit jobs via batch scripts to SLURM.

batch script example

#!/bin/bash -l                                                          
#SBATCH --job-name=testprogram             # job name
#SBATCH --partition=slurm-general-01       # specifying which partition to run job on, if omitted default partition will be used (slurm-general)
#SBATCH --account=slurmgeneral             # only applicable if user is assigned multiple accounts
#SBATCH --ntasks=1                         # commands to run in parallel
#SBATCH --time=10:00:00 
#SBATCH --mem=1gb                          # request 1gb of memory
#SBATCH --output=testprogram.log           # output and error log

date
sleep 10
python3 someProgram.py
date

submitting a job using sbatch

sbatch myprogram.sh                                                      # queue job using a batch script. Default partition will be used.

sbatch --partition slurm-general-01 --account=slurmgeneral myprogram.sh  # batch script specifying which partition and account when not specified within the batch script

Slurm Command Line

List of basic slurm commands and their description.

Command	Syntax	Example	Description
Job submission	sbatch <filename>	sbatch batch_file.sh	Submit a batch script to Slurm
Job submission	srun <resource-parameters> <filename>	srun batch_file.sh / srun echo "`Testing`"	Submit and run a job interactively. Also used for job steps within a sbatch script
Job Deletion	scancel <job-id>	scancel 100	Stop or cancel submitted job
Job status	squeue -u <username>	squeue -u cs_HPCuser001	Check job status by user
Cluster resources	sinfo	sinfo	Current status about compute nodes within the HPC cluster

Compute Resources

The ODU CS department HPC cluster is comprised of multiple partitions where users can submit jobs. Each partition can only be accessed by users who are assigned to the partitions respective account. Not all partitions can be accessed by all users.

Resources
Cluster	Partition	Node(s)	Account(s)
slurm-cluster	slurm-general	slurm-a40-collab, slurm-a40-collab-2, slurm-p100-collab	slurmgeneral, shaoresearch, fwangresearch
slurm-cluster	lusiliresearch	slurm-a40-collab, slurm-a40-collab-2, slurm-p100-collab	slurmgeneral, shaoresearch, fwangresearch
slurm-cluster	haoresearch	slurm-a6000-hao	shaoresearch
slurm-cluster	wangresearch	slurm-a6000-wang	fwangresearch

Node Hardware
Node	GPU(s)	CPU(s)	RAM
slurm-a40-collab	Nvidia A40 (8)	96	375GB
slurm-a40-collab-2	Nvidia A40 (7)	96	500GB
slurm-p100-collab	Nvidia P100 (4)	40	181GB
slurm-a6000-hao	Nvidia A6000 (4)	96	515GB
slurm-a6000-wang	Nvidia A6000 (8)	96	385GB

Accessing the HPC Cluster

To gain access to the HPC cluster email root@cs.odu.edu. After you have been given access, you can connect to the HPC cluster by issuing the command below.

ssh cs_username001@slurm-manager.cs.odu.edu

Jupyter Notebook

Jupyter notebook is available for use via an apptainer container. Containers are located in /mnt/apptainer/. Below is an example on how to launch Jupyter notebook.

cs_user001@slurm-manager:~$ srun --cpus-per-task=16 --mem=50GB apptainer run --nv /mnt/apptainer/jupyter-gpu.sif

Troubleshooting

How to view assigned account

sacctmgr show association -p user=$username

Errors regarding SBATCH file

chmod +x sbatch_filename.sh

@@ Line 90: / Line 90: @@
 |+ Resources
 |-
-! Cluster !! Partition !! Node(s) !! Account(s)
+! Cluster !! Partition !! Node(s) !! Account(s) !! Hardware
 |-
 | slurm-cluster || slurm-general|| slurm-a40-collab, slurm-a40-collab-2, slurm-p100-collab || slurmgeneral, shaoresearch, fwangresearch
@@ Line 96: / Line 96: @@
 | slurm-cluster || lusiliresearch || slurm-a40-collab, slurm-a40-collab-2, slurm-p100-collab  || slurmgeneral, shaoresearch, fwangresearch
 |-
-| slurm-cluster || haoresearch || slurm-a6000-hao || shaoresearch
+| slurm-cluster || haoresearch || slurm-a6000-hao || shaoresearch ||
 |-
-| slurm-cluster || wangresearch || slurm-a6000-wang|| fwangresearch
+| slurm-cluster || wangresearch || slurm-a6000-wang|| fwangresearch ||
+|}
+{| class="wikitable"
+|+ Node Hardware
+|-
+! Node !! GPU(s) !! CPU(s) !! RAM !!
+|-
+| slurm-a40-collab || Nvidia A40 (8)|| 96 || 375GB
+|-
+| slurm-a40-collab-2 || Nvidia A40 (7) || 96  || 500GB
+|-
+| slurm-p100-collab  || Nvidia P100 (4) || 40 || 181GB
+|-
+| slurm-a6000-hao || Nvidia A6000 (4) || 96 || 515GB
+|-
+| slurm-a6000-wang  || Nvidia A6000 (8) || 96 || 385GB
 |}

Difference between revisions of "Slurm"

Revision as of 15:22, 11 April 2023

Contents

SLURM Directives

SRUN

SBATCH

Slurm Command Line

Compute Resources

Accessing the HPC Cluster

Jupyter Notebook

Troubleshooting

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

More About Us

Tools