Skip to content

Slurm


Slurm is an open-source job scheduler for Linux and Unix-like kernels.

Quick Start Guide

Here are all the steps you need to take in order to get started and practice using the Computer Science Dept. HPC cluster.

  1. Login to slurm-manager
    ssh <username>@slurm-manager.cs.odu.edu

    💡 replace <username> with your CS username
    💡 to gain access to the HPC cluster, email root@cs.odu.edu 📧
    💥 if not within the CS network, you must be connected to the CS VPN

  2. Load anaconda into your current environment
    module use /mnt/lmod_modules/Linux/
    module load miniconda3

  3. Use anaconda to manage your virtual environments
    conda create -p /mnt/hpc_projects/<username>/.envs/quick_start numpy==1.23.4 tensorflow-gpuconda create -p /mnt/hpc_projects/cs_username001/.envs/quick_start numpy==1.23.4 tensorflow-gpu

    💡 to get familiar with anaconda, here is the official documentation

  4. Create directory for your project
    mkdir <name>
    cd <name>

    💡 replace <name> with the name you want the folder to have

  5. create and edit gnu_test.py

    import tensorflow as tf
    
    if tf.test.gpu_device_name():
        print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))
    else:
        print("Please install GPU version of TF")
    
  6. create and edit test_program.sh

    #!/bin/bash -l
    #SBATCH --job-name=gpu_test                # job name
    #SBATCH --partition=slurm-general          # specifying which partition to run job on, if omitted default partition will be used (slurm-general)
    #SBATCH --account=slurmgeneral             # only applicable if user is assigned multiple accounts
    #SBATCH --ntasks=1                         # commands to run in parallel
    #SBATCH --time=1:00:00                     # time limit on the job
    #SBATCH --mem=1gb                          # request 1gb of memory
    #SBATCH --output=gpu_test.log              # output and error log
    
    date
    python3 gpu_test.py
    
  7. make test_program.sh executable
    chmod +x test_program.sh

  8. submit your batch script to slurm
    sbatch test_program.sh
  9. you can view the status of the job with the following
    squeue -u <username>

    💡 replace <username> with your username

Once your job is complete, a new file will appear in your project directory named gpu_test.log. By reading the contents of this log file, you should see tensorflow identify any available GPUs.
If you have any issues with the GPU cluster itself, please send an email to root@cs.odu.edu with any errors you have received so that we may resolve the issue. Please note that your code or the python libraries used in your project are not managed by us. While we may be able to point you in the right direction for some of those issues, it is ultimately up to you to resolve issues pertaining to your project.


Chat with Archibald

Disclaimer

Please note: Archibald is an AI assistant for the ODU CS Systems Group. Information provided is based on internal documentation and policies.