Slurm

Slurm is an open-source job scheduler for Linux and Unix-like kernels.

Quick Start Guide

Here are all the steps you need to take in order to get started and practice using the Computer Science Dept. HPC cluster.

Login to slurm-manager
ssh <username>@slurm-manager.cs.odu.edu

replace <username> with your CS username
to gain access to the HPC cluster, email root@cs.odu.edu
if not within the CS network, you must be connected to the CS VPN
Load anaconda into your current environment
module use /mnt/lmod_modules/Linux/
module load miniconda3
Use anaconda to manage your virtual environments
conda create -p /mnt/hpc_projects/<username>/.envs/quick_start numpy==1.23.4 tensorflow-gpu

to get familiar with anaconda, here is the official documentation
Activate conda environment conda activate /mnt/hpc_projects/<username>/.envs/quick_start
Create directory for your project
mkdir <name>
cd <name>

replace <name> with the name you want the folder to have

create and edit gpu_test.py

import tensorflow as tf

if tf.test.gpu_device_name():
    print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))
else:
    print("Please install GPU version of TF")

create and edit test_program.sh

#!/bin/bash -l
#SBATCH --job-name=gpu_test                # job name
#SBATCH --partition=slurm-general          # specifying which partition to run job on, if omitted default partition will be used (slurm-general)
#SBATCH --account=slurmgeneral             # only applicable if user is assigned multiple accounts
#SBATCH --ntasks=1                         # commands to run in parallel
#SBATCH --time=1:00:00                     # time limit on the job
#SBATCH --mem=1gb                          # request 1gb of memory
#SBATCH --output=gpu_test.log              # output and error log

date
python3 gpu_test.py

make test_program.sh executable
chmod +x test_program.sh
submit your batch script to slurm
sbatch test_program.sh
you can view the status of the job with the following
squeue -u <username>

replace <username> with your username

Once your job is complete, a new file will appear in your project directory named gpu_test.log. By reading the contents of this log file, you should see tensorflow identify any available GPUs.
If you have any issues with the GPU cluster itself, please send an email to root@cs.odu.edu with any errors you have received so that we may resolve the issue. Please note that your code or the python libraries used in your project are not managed by us. While we may be able to point you in the right direction for some of those issues, it is ultimately up to you to resolve issues pertaining to your project.

Slurm

Quick Start Guide

Chat with Archibald

Disclaimer