Header left.png

Difference between revisions of "Slurm"

From Systems Group
Jump to: navigation, search
(Jupyter Notebook)
Line 65: Line 65:
  
 
  sbatch --partition slurm-general-01 --account=slurmgeneral myprogram.sh  # batch script specifying which partition and account when not specified within the batch script
 
  sbatch --partition slurm-general-01 --account=slurmgeneral myprogram.sh  # batch script specifying which partition and account when not specified within the batch script
 +
 +
 +
== Slurm Command Line ==
 +
List of basic slurm commands and their description.
 +
 +
{| class="wikitable"
 +
|-
 +
! Command !! Syntax !! Example !!Description
 +
|-
 +
| Job submission || sbatch <filename>  || sbatch batch_file.sh    || Submit a batch script to Slurm
 +
|-
 +
| Job submission || srun <resource-parameters> <filename> || srun batch_file.sh / srun echo "`Testing`" || Submit and run a job interactively. Also used for job steps within a sbatch script
 +
|-
 +
| Job Deletion  || scancel <job-id>  || scancel 100        || Stop or cancel submitted job
 +
|-
 +
| Job status    || squeue -u <username>|| squeue -u cs_HPCuser001                    || Check job status by user
 +
|-
 +
| Cluster resources  || sinfo          || sinfo                    || Current status about compute nodes within the HPC cluster
 +
|-
 +
|}
  
 
== Compute Resources ==
 
== Compute Resources ==

Revision as of 18:16, 18 November 2022

Slurm is an open-source job scheduler for Linux and Unix-like kernels.

SLURM Directives

SLURM directives are job options that constrain the job to the conditions specified. Directives can be identified by the syntax `#SBATCH <flag>`. Commonly used flags are listed below.

Flags
Resource Syntax Example Description
Account --account=<account> --account=slurmgeneral entity which resources are charged to. available accounts
Partition --partition=<partition> --partition=slurm-general-01 where job resources are allocated. available partitions
Job Name --job-name=<filename> --job-name=testprogram name of job to be queued
Task --ntask=<number> --ntask=2 useful for commands to be ran in parallel
Memory --mem=<size>[units] --mem=1gb memory to be allocated for job
Output --output=<filename> --output=testprogram.log name of job output file
Time --time= --time=01:00:00 time limit for job

SRUN

srun is used to submit jobs for execution in real time. Also used to create job steps.


srun example

srun --partition slurm-general-01 --account=slurmgeneral myprogram           # shell on compute node
                                                                             # specifying which partition and account (applicable if assigned multiple accounts)
srun myprogram.sh                                                            # shell on compute node
                                                                             # default partition and account is used when not specified
                                                                             # default partition is slurm-general

SBATCH

Sbatch is a command used to submit jobs via batch scripts to SLURM.


batch script example

#!/bin/bash -l                                                          
#SBATCH --job-name=testprogram             # job name
#SBATCH --partition=slurm-general-01       # specifying which partition to run job on, if omitted default partition will be used (slurm-general)
#SBATCH --account=slurmgeneral             # only applicable if user is assigned multiple accounts
#SBATCH --ntasks=1                         # commands to run in parallel
#SBATCH --time=10:00:00 
#SBATCH --mem=1gb                          # request 1gb of memory
#SBATCH --output=testprogram.log           # output and error log

date
sleep 10
module use /mnt/lmod_modules/Linux/
module load miniconda3
someProgram.py
date


submitting a job using sbatch

sbatch myprogram.sh                                                      # queue job using a batch script. Default partition will be used. 
sbatch --partition slurm-general-01 --account=slurmgeneral myprogram.sh  # batch script specifying which partition and account when not specified within the batch script


Slurm Command Line

List of basic slurm commands and their description.

Command Syntax Example Description
Job submission sbatch <filename> sbatch batch_file.sh Submit a batch script to Slurm
Job submission srun <resource-parameters> <filename> srun batch_file.sh / srun echo "`Testing`" Submit and run a job interactively. Also used for job steps within a sbatch script
Job Deletion scancel <job-id> scancel 100 Stop or cancel submitted job
Job status squeue -u <username> squeue -u cs_HPCuser001 Check job status by user
Cluster resources sinfo sinfo Current status about compute nodes within the HPC cluster

Compute Resources

The ODU CS department HPC cluster is comprised of multiple partitions where users can submit jobs. Each partition can only be accessed by users who are assigned to the partitions respective account. Not all partitions can be accessed by all users.

Resources
Cluster Partition Account
slurm-cluster slurm-general slurmgeneral
slurm-cluster haoresearch shaoresearch
slurm-cluster lusiliresearch lliresearch
slurm-cluster wangresearch fwangresearch

Accessing the HPC Cluster

To gain access to the HPC cluster email root@cs.odu.edu. After you have been given access, issue the command below to connect the the HPC Cluster.

ssh cs_username001@slurm-manager.cs.odu.edu

Jupyter Notebook

Jupyter notebook is available for use via an apptainer container. Containers are located in /mnt/apptainer/. Below is an example on how to launch Jupyter notebook with a batch script.

cs_user001@slurm-manager:~$ nano jupyter-nb.sh
                                                      


#!/bin/bash -l                                                          
#SBATCH --job-name=jupyter-nb              # job name
#SBATCH --ntasks=1                         # commands to run in parallel
#SBATCH --mem=8gb                          # request 8gb of memory
#SBATCH --gres=gpu:4                       # request 4 GPUs
#SBATCH --time=48:00:00                    # job runs for 48 hours 
#SBATCH --output=jupyter-nb.log            # output and error log

apptainer run --nv /mnt/apptainer/jupyter-gpu.sif


chmod +x jupyter-nb.sh


To launch jupyter notebook issue the following command.

cs_user001@slurm-manager:~$ srun jupyter-nb.sh

Troubleshooting

How to view assigned account

sacctmgr show association -p user=$username

Errors regarding SBATCH file

chmod +x sbatch_filename.sh