Slurm
Slurm is an open-source job scheduler for Linux and Unix-like kernels.
Quick Start Guide
Here are all the steps you need to take in order to get started and practice using the Computer Science Dept. HPC cluster.
-
Login to slurm-manager
ssh <username>@slurm-manager.cs.odu.edu
replace
<username>
with your CS username
to gain access to the HPC cluster, email root@cs.odu.edu
if not within the CS network, you must be connected to the CS VPN -
Load anaconda into your current environment
module use /mnt/lmod_modules/Linux/
module load miniconda3
-
Use anaconda to manage your virtual environments
conda create -p /mnt/hpc_projects/<username>/.envs/quick_start numpy==1.23.4 tensorflow-gpuconda create -p /mnt/hpc_projects/cs_username001/.envs/quick_start numpy==1.23.4 tensorflow-gpu
to get familiar with anaconda, here is the official documentation
-
Create directory for your project
mkdir <name>
cd <name>
replace
<name>
with the name you want the folder to have -
create and edit
gnu_test.py
import tensorflow as tf if tf.test.gpu_device_name(): print('Default GPU Device: {}'.format(tf.test.gpu_device_name())) else: print("Please install GPU version of TF")
-
create and edit
test_program.sh
#!/bin/bash -l #SBATCH --job-name=gpu_test # job name #SBATCH --partition=slurm-general # specifying which partition to run job on, if omitted default partition will be used (slurm-general) #SBATCH --account=slurmgeneral # only applicable if user is assigned multiple accounts #SBATCH --ntasks=1 # commands to run in parallel #SBATCH --time=1:00:00 # time limit on the job #SBATCH --mem=1gb # request 1gb of memory #SBATCH --output=gpu_test.log # output and error log date python3 gpu_test.py
-
make test_program.sh executable
chmod +x test_program.sh
- submit your batch script to slurm
sbatch test_program.sh
- you can view the status of the job with the following
squeue -u <username>
replace
<username>
with your username
Once your job is complete, a new file will appear in your project directory named gpu_test.log. By reading the contents of this log file, you should see tensorflow identify any available GPUs.
If you have any issues with the GPU cluster itself, please send an email to root@cs.odu.edu with any errors you have received so that we may resolve the issue. Please note that your code or the python libraries used in your project are not managed by us. While we may be able to point you in the right direction for some of those issues, it is ultimately up to you to resolve issues pertaining to your project.