Header left.png

Difference between revisions of "Slurm"

From Systems Group
Jump to: navigation, search
(SLURM Directives)
(Compute Resources)
 
(44 intermediate revisions by one user not shown)
Line 1: Line 1:
 
Slurm is an open-source job scheduler for Linux and Unix-like kernels.
 
Slurm is an open-source job scheduler for Linux and Unix-like kernels.
 +
 +
 +
 +
==Quick Start Guide==
 +
Here are all the steps you need to take in order to get started and practice using the Computer Science Dept. HPC cluster.
 +
 +
When following along with this guide, please replace every instance of "''cs_username001''" with your CS account name.
 +
 +
 +
'''1. Login to slurm-manager'''
 +
 +
To gain access to the HPC cluster, email root@cs.odu.edu. After you have been given access, you can connect to the HPC cluster by issuing the command below. ** If accessing from a computer not on the CS network, you must be connected to the CS VPN **
 +
 +
ssh cs_username001@slurm-manager.cs.odu.edu
 +
 +
When you are granted access to the HPC cluster, you are also given a research share located at ''/mnt/hpc_projects/cs_username001/''.
 +
 +
You are encouraged to use this share for all your projects as it is much larger then your regular home directory and will allow you to install the libraries required for your projects. 
 +
 +
 +
'''2. Load anaconda into your current environment'''
 +
module use /mnt/lmod_modules/Linux/
 +
module load miniconda3
 +
'''3. Use anaconda to manage your virtual environments'''
 +
 +
Now that you have access to anaconda, you can easily and efficiently manage your own libraries for your projects. If you are unfamiliar with anaconda, you will find [https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html the official documentation] very helpful.
 +
 +
For this example, we will create a new virtual environment called ''quick_start'' and activate it by using the following commands.
 +
conda create -p /mnt/hpc_projects/cs_username001/.envs/quick_start numpy==1.23.4 tensorflow-gpu
 +
conda activate /mnt/hpc_projects/cs_username001/.envs/quick_start
 +
 +
'''4. Run your project'''
 +
 +
First, create a directory for this example project and enter into it.
 +
mkdir Quick_Start
 +
cd Quick_Start/
 +
Then, copy these two files into your project directory.
 +
 +
'''''gpu_test.py'''''
 +
import tensorflow as tf
 +
 +
if tf.test.gpu_device_name():
 +
    print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))
 +
else:
 +
    print("Please install GPU version of TF")
 +
'''''test_program.sh'''''
 +
#!/bin/bash -l
 +
#SBATCH --job-name=gpu_test                # job name
 +
#SBATCH --partition=slurm-general          # specifying which partition to run job on, if omitted default partition will be used (slurm-general)
 +
#SBATCH --account=slurmgeneral            # only applicable if user is assigned multiple accounts
 +
#SBATCH --ntasks=1                        # commands to run in parallel
 +
#SBATCH --time=1:00:00                    # time limit on the job
 +
#SBATCH --mem=1gb                          # request 1gb of memory
 +
#SBATCH --output=gpu_test.log              # output and error log
 +
 +
date
 +
python3 gpu_test.py
 +
Next, make ''test_program.sh'' executable.
 +
chmod +x test_program.sh
 +
Finally, submit your batch script to slurm.
 +
sbatch test_program.sh
 +
You can view the status of your job with this command.
 +
squeue -u cs_username001
 +
Once your job is complete, a new file will appear in your project directory named ''gpu_test.log.'' By reading the contents of this log file, you should see tensorflow identify any available GPUs.
 +
 +
If you have any issues with the GPU cluster itself, please send an email to root@cs.odu.edu with any errors you have received so that we may resolve the issue. Please note that your code or the python libraries used in your project are not managed by us. While we may be able to point you in the right direction for some of those issues, it is ultimately up to you to resolve issues pertaining to your project.
 +
 +
== Compute Resources ==
 +
The ODU CS department HPC cluster is comprised of multiple partitions where users can submit jobs. Each partition can only be accessed by users who are assigned to the partitions respective account. Not all partitions can be accessed by all users.
 +
 +
{| class="wikitable"
 +
|+ Resources
 +
|-
 +
! Cluster !! Partition !! Node(s) !! Account(s)
 +
|-
 +
| slurm-cluster || slurm-general|| slurm-a40-collab, slurm-p100-collab || slurmgeneral, shaoresearch, fwangresearch
 +
|-
 +
| slurm-cluster || lusiliresearch || slurm-a40-collab, slurm-p100-collab  || slurmgeneral, shaoresearch, fwangresearch
 +
|-
 +
| slurm-cluster || haoresearch || slurm-a6000-hao || shaoresearch
 +
|-
 +
| slurm-cluster || wangresearch || slurm-a6000-wang|| fwangresearch
 +
|}
 +
 +
{| class="wikitable"
 +
|+ Node Hardware
 +
|-
 +
! Node !! GPU(s) !! CPU(s) !! RAM
 +
|-
 +
| slurm-a40-collab || Nvidia A40 (8)|| 96 || 375GB
 +
|-
 +
| slurm-a40-collab-2 || Nvidia A40 (7) || 96  || 500GB
 +
|-
 +
| slurm-p100-collab  || Nvidia P100 (4) || 40 || 181GB
 +
|-
 +
| slurm-a6000-hao || Nvidia A6000 (4) || 96 || 515GB
 +
|-
 +
| slurm-a6000-wang  || Nvidia A6000 (8) || 96 || 385GB
 +
|}
  
 
==SLURM Directives==
 
==SLURM Directives==
SLURM directives are job options that constrain the job to the conditions specified. Directives can be identified by the syntax `#SBATCH <flag>`. Commonly used flags are listed below.
+
SLURM directives are job options that constrain the job to the conditions specified. Directives can be identified by the syntax `#SBATCH <flag>`. These directives can be used with srun and within a batch script (sbatch). Commonly used flags are listed below.
  
 
{| class="wikitable"
 
{| class="wikitable"
|+ Flags
 
 
|-
 
|-
 
! Resource !! Syntax !! Example !!Description
 
! Resource !! Syntax !! Example !!Description
Line 18: Line 116:
 
|-
 
|-
 
| Memory    || --mem=<size>[units]          || --mem=1gb                    || memory to be allocated for job
 
| Memory    || --mem=<size>[units]          || --mem=1gb                    || memory to be allocated for job
 +
|-
 +
| CPU    || --cpus-per-task=<number>          || --cpus-per-task=16                    || CPUs to be allocated per job task
 
|-
 
|-
 
| Output    || --output=<filename>          || --output=testprogram.log      || name of job output file  
 
| Output    || --output=<filename>          || --output=testprogram.log      || name of job output file  
Line 31: Line 131:
 
'''srun example'''
 
'''srun example'''
  
  srun --partition slurm-general-01 --account=slurmgeneral --pty /bin/bash    # shell on compute node
+
  srun --partition slurm-general-01 --account=slurmgeneral myprogram          # shell on compute node
 
                                                                               # specifying which partition and account (applicable if assigned multiple accounts)
 
                                                                               # specifying which partition and account (applicable if assigned multiple accounts)
  
  srun --pty /bin/bash                                                        # shell on compute node
+
  srun myprogram.sh                                                            # shell on compute node
 
                                                                               # default partition and account is used when not specified
 
                                                                               # default partition and account is used when not specified
                                                                               # '''This is not recommended as submitted jobs can be put in a pending state due to incorrect permissions'''
+
                                                                               # default partition is slurm-general
  
 
== SBATCH ==
 
== SBATCH ==
Line 44: Line 144:
 
'''batch script example'''
 
'''batch script example'''
  
  #!/bin/bash -l                             # login shell (required for lmod)                           
+
  #!/bin/bash -l                                                        
 
  #SBATCH --job-name=testprogram            # job name
 
  #SBATCH --job-name=testprogram            # job name
  #SBATCH --partition=slurm-general-01      # specifying which partition to run job on
+
  #SBATCH --partition=slurm-general-01      # specifying which partition to run job on, if omitted default partition will be used (slurm-general)
 
  #SBATCH --account=slurmgeneral            # only applicable if user is assigned multiple accounts
 
  #SBATCH --account=slurmgeneral            # only applicable if user is assigned multiple accounts
 
  #SBATCH --ntasks=1                        # commands to run in parallel
 
  #SBATCH --ntasks=1                        # commands to run in parallel
 +
#SBATCH --time=10:00:00
 
  #SBATCH --mem=1gb                          # request 1gb of memory
 
  #SBATCH --mem=1gb                          # request 1gb of memory
  #SBATCH --output=testprogram.lob           # output and error log
+
  #SBATCH --output=testprogram.log           # output and error log
 
   
 
   
 
  date
 
  date
 
  sleep 10
 
  sleep 10
  module use /mnt/lmod_modules/Linux/
+
  python3 someProgram.py
module load miniconda3
+
someProgram.py
+
 
  date
 
  date
  
  
 
'''submitting a job using sbatch'''
 
'''submitting a job using sbatch'''
  sbatch myprogram.sh                                                      # queue job using a batch script  
+
  sbatch myprogram.sh                                                      # queue job using a batch script. Default partition will be used.
  
 
  sbatch --partition slurm-general-01 --account=slurmgeneral myprogram.sh  # batch script specifying which partition and account when not specified within the batch script
 
  sbatch --partition slurm-general-01 --account=slurmgeneral myprogram.sh  # batch script specifying which partition and account when not specified within the batch script
  
== Compute Resources ==
+
== Slurm Command Line ==
The ODU CS department HPC cluster is comprised of multiple partitions where users can submit jobs. Each partition can only be accessed by users who are assigned to the partitions respective account. Not all partitions can be accessed by all users.  
+
List of basic slurm commands and their description.
  
 
{| class="wikitable"
 
{| class="wikitable"
|+ Resources
 
 
|-
 
|-
! Cluster !! Partition !! Account
+
! Command !! Syntax !! Example !!Description
 
|-
 
|-
| slurm-cluster || slurm-general-01 || slurmgeneral
+
| Job submission || sbatch <filename>  || sbatch batch_file.sh    || Submit a batch script to Slurm
 
|-
 
|-
| slurm-cluster || slurm-general-02 || slurmgeneral
+
| Job submission || srun <resource-parameters> <filename> || srun batch_file.sh / srun echo "`Testing`" || Submit and run a job interactively. Also used for job steps within a sbatch script
 
|-
 
|-
| slurm-cluster || haoresearch || shaoresearch
+
| Job Deletion  || scancel <job-id>  || scancel 100        || Stop or cancel submitted job
 
|-
 
|-
| slurm-cluster || lusiliresearch || lliresearch
+
| Job status    || squeue -u <username>|| squeue -u cs_HPCuser001                    || Check job status by user
 +
|-
 +
| Cluster resources  || sinfo          || sinfo                    || Current status about compute nodes within the HPC cluster
 
|-
 
|-
| slurm-cluster || wangresearch || fwangresearch
 
 
|}
 
|}
 +
 +
 +
 +
 +
== Jupyter Notebook ==
 +
 +
Jupyter notebook is available for use via an apptainer container. Containers are located in /mnt/apptainer/. Below is an example on how to launch Jupyter notebook.
 +
cs_user001@slurm-manager:~$ srun --cpus-per-task=16 --mem=50GB apptainer run --nv /mnt/apptainer/jupyter-gpu.sif
 +
 +
 +
  
 
== Troubleshooting ==
 
== Troubleshooting ==
Line 88: Line 198:
 
'''How to view assigned account'''
 
'''How to view assigned account'''
 
  sacctmgr show association -p user=$username
 
  sacctmgr show association -p user=$username
 +
 +
'''Errors regarding SBATCH file'''
 +
chmod +x sbatch_filename.sh

Latest revision as of 13:14, 22 August 2023

Slurm is an open-source job scheduler for Linux and Unix-like kernels.


Quick Start Guide

Here are all the steps you need to take in order to get started and practice using the Computer Science Dept. HPC cluster.

When following along with this guide, please replace every instance of "cs_username001" with your CS account name.


1. Login to slurm-manager

To gain access to the HPC cluster, email root@cs.odu.edu. After you have been given access, you can connect to the HPC cluster by issuing the command below. ** If accessing from a computer not on the CS network, you must be connected to the CS VPN **

ssh cs_username001@slurm-manager.cs.odu.edu

When you are granted access to the HPC cluster, you are also given a research share located at /mnt/hpc_projects/cs_username001/.

You are encouraged to use this share for all your projects as it is much larger then your regular home directory and will allow you to install the libraries required for your projects.


2. Load anaconda into your current environment

module use /mnt/lmod_modules/Linux/
module load miniconda3 

3. Use anaconda to manage your virtual environments

Now that you have access to anaconda, you can easily and efficiently manage your own libraries for your projects. If you are unfamiliar with anaconda, you will find the official documentation very helpful.

For this example, we will create a new virtual environment called quick_start and activate it by using the following commands.

conda create -p /mnt/hpc_projects/cs_username001/.envs/quick_start numpy==1.23.4 tensorflow-gpu
conda activate /mnt/hpc_projects/cs_username001/.envs/quick_start

4. Run your project

First, create a directory for this example project and enter into it.

mkdir Quick_Start
cd Quick_Start/

Then, copy these two files into your project directory.

gpu_test.py

import tensorflow as tf

if tf.test.gpu_device_name():
    print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))
else:
    print("Please install GPU version of TF")

test_program.sh

#!/bin/bash -l
#SBATCH --job-name=gpu_test                # job name
#SBATCH --partition=slurm-general          # specifying which partition to run job on, if omitted default partition will be used (slurm-general)
#SBATCH --account=slurmgeneral             # only applicable if user is assigned multiple accounts
#SBATCH --ntasks=1                         # commands to run in parallel
#SBATCH --time=1:00:00                     # time limit on the job
#SBATCH --mem=1gb                          # request 1gb of memory
#SBATCH --output=gpu_test.log              # output and error log

date
python3 gpu_test.py

Next, make test_program.sh executable.

chmod +x test_program.sh

Finally, submit your batch script to slurm.

sbatch test_program.sh

You can view the status of your job with this command.

squeue -u cs_username001

Once your job is complete, a new file will appear in your project directory named gpu_test.log. By reading the contents of this log file, you should see tensorflow identify any available GPUs.

If you have any issues with the GPU cluster itself, please send an email to root@cs.odu.edu with any errors you have received so that we may resolve the issue. Please note that your code or the python libraries used in your project are not managed by us. While we may be able to point you in the right direction for some of those issues, it is ultimately up to you to resolve issues pertaining to your project.

Compute Resources

The ODU CS department HPC cluster is comprised of multiple partitions where users can submit jobs. Each partition can only be accessed by users who are assigned to the partitions respective account. Not all partitions can be accessed by all users.

Resources
Cluster Partition Node(s) Account(s)
slurm-cluster slurm-general slurm-a40-collab, slurm-p100-collab slurmgeneral, shaoresearch, fwangresearch
slurm-cluster lusiliresearch slurm-a40-collab, slurm-p100-collab slurmgeneral, shaoresearch, fwangresearch
slurm-cluster haoresearch slurm-a6000-hao shaoresearch
slurm-cluster wangresearch slurm-a6000-wang fwangresearch
Node Hardware
Node GPU(s) CPU(s) RAM
slurm-a40-collab Nvidia A40 (8) 96 375GB
slurm-a40-collab-2 Nvidia A40 (7) 96 500GB
slurm-p100-collab Nvidia P100 (4) 40 181GB
slurm-a6000-hao Nvidia A6000 (4) 96 515GB
slurm-a6000-wang Nvidia A6000 (8) 96 385GB

SLURM Directives

SLURM directives are job options that constrain the job to the conditions specified. Directives can be identified by the syntax `#SBATCH <flag>`. These directives can be used with srun and within a batch script (sbatch). Commonly used flags are listed below.

Resource Syntax Example Description
Account --account=<account> --account=slurmgeneral entity which resources are charged to. available accounts
Partition --partition=<partition> --partition=slurm-general-01 where job resources are allocated. available partitions
Job Name --job-name=<filename> --job-name=testprogram name of job to be queued
Task --ntask=<number> --ntask=2 useful for commands to be ran in parallel
Memory --mem=<size>[units] --mem=1gb memory to be allocated for job
CPU --cpus-per-task=<number> --cpus-per-task=16 CPUs to be allocated per job task
Output --output=<filename> --output=testprogram.log name of job output file
Time --time= --time=01:00:00 time limit for job

SRUN

srun is used to submit jobs for execution in real time. Also used to create job steps.


srun example

srun --partition slurm-general-01 --account=slurmgeneral myprogram           # shell on compute node
                                                                             # specifying which partition and account (applicable if assigned multiple accounts)
srun myprogram.sh                                                            # shell on compute node
                                                                             # default partition and account is used when not specified
                                                                             # default partition is slurm-general

SBATCH

Sbatch is a command used to submit jobs via batch scripts to SLURM.


batch script example

#!/bin/bash -l                                                          
#SBATCH --job-name=testprogram             # job name
#SBATCH --partition=slurm-general-01       # specifying which partition to run job on, if omitted default partition will be used (slurm-general)
#SBATCH --account=slurmgeneral             # only applicable if user is assigned multiple accounts
#SBATCH --ntasks=1                         # commands to run in parallel
#SBATCH --time=10:00:00 
#SBATCH --mem=1gb                          # request 1gb of memory
#SBATCH --output=testprogram.log           # output and error log

date
sleep 10
python3 someProgram.py
date


submitting a job using sbatch

sbatch myprogram.sh                                                      # queue job using a batch script. Default partition will be used. 
sbatch --partition slurm-general-01 --account=slurmgeneral myprogram.sh  # batch script specifying which partition and account when not specified within the batch script

Slurm Command Line

List of basic slurm commands and their description.

Command Syntax Example Description
Job submission sbatch <filename> sbatch batch_file.sh Submit a batch script to Slurm
Job submission srun <resource-parameters> <filename> srun batch_file.sh / srun echo "`Testing`" Submit and run a job interactively. Also used for job steps within a sbatch script
Job Deletion scancel <job-id> scancel 100 Stop or cancel submitted job
Job status squeue -u <username> squeue -u cs_HPCuser001 Check job status by user
Cluster resources sinfo sinfo Current status about compute nodes within the HPC cluster



Jupyter Notebook

Jupyter notebook is available for use via an apptainer container. Containers are located in /mnt/apptainer/. Below is an example on how to launch Jupyter notebook.

cs_user001@slurm-manager:~$ srun --cpus-per-task=16 --mem=50GB apptainer run --nv /mnt/apptainer/jupyter-gpu.sif



Troubleshooting

How to view assigned account

sacctmgr show association -p user=$username

Errors regarding SBATCH file

chmod +x sbatch_filename.sh