NYU Greene HPC

Quick Start Application Useful commands Storage Usage Datasets Access datasets with Singularity Best Practice Submit job with SLURM Check how many resources you can access srun sbatch Useful commands about SLURM Debug pdb VScode VScode SSH configuration Change the default .vscode-server location Jupyter Lab / Notebook Others Multiple steps with srun Single step with sbatch Python Environment Management with Singularity Create your own overlay image Run Python Script with self-created Singularity image References

Quick Start

Application

Please refer to NYU HPC pages for details.

NYU High Performance Computing - Getting and Renewing an Account

Click here to request HPC accounts. (NYU VPN is required) Who is eligible for an NYU HPC account? NYU HPC clusters and related resources are available to full-time NYU faculty and to all other NYU staff and students with sponsorship from a full-time NYU faculty. Please note: All sponsored accounts

https://sites.google.com/nyu.edu/nyu-hpc/accessing-hpc/getting-and-renewing-an-account#h.p_ID_46

Useful commands

command	explanation	ㅤ
`myquota`	查看自己当前的存储配额状况	ㅤ
ㅤ	ㅤ	ㅤ

Storage Usage

Environment Variable	Location	Suggest Usage	Initial Quota (Space/Files)
`$HOME`	`/home/$USER/`	Personal user home space that is best for small files OR DON’T USE AT ALL	50.0GB / 30.0K
`$SCRATCH`	`/scratch/$USER/`	Best for large files	5.0TB / 1.0M
`$VAST`	`/vast/$USER/`	Flash memory for high I/O workflows	2.0TB / 5.0M
`$ARCHIVE`	`/archive/$USER/`	Long-term storage	2.0TB / 20.0K
ㅤ	ㅤ	ㅤ	ㅤ

Datasets

💡

For better IO performance, it’s recommended to use datasets stored at /vast

Dataset Name	Folder	Comments
ImageNet	`/scratch/work/public/ml-datasets/imagenet` `/vast/work/public/ml-datasets/imagenet`	Need to apply for access
MS-COCO	`/scratch/work/public/ml-datasets/coco/coco-2014.sqf` `/scratch/work/public/ml-datasets/coco/coco-2015.sqf /scratch/work/public/ml-datasets/coco/coco-2017.sqf` `/vast/work/public/ml-datasets/coco/coco-2014.sqf /vast/work/public/ml-datasets/coco/coco-2015.sqf /vast/work/public/ml-datasets/coco/coco-2017.sqf`	ㅤ
ㅤ	ㅤ	ㅤ

Access datasets with Singularity


singularity exec \
  --overlay /<path>/pytorch1.8.0-cuda11.1.ext3:ro \
  --overlay /vast/work/public/ml-datasets/coco/coco-2014.sqf:ro \
  --overlay /vast/work/public/ml-datasets/coco/coco-2015.sqf:ro \
  --overlay /vast/work/public/ml-datasets/coco/coco-2017.sqf:ro \
  /scratch/work/public/singularity/cuda11.1-cudnn8-devel-ubuntu18.04.sif \
	/bin/bash


singularity shell \
	--nv \
  --overlay /vast/work/public/ml-datasets/imagenet/imagenet-train.sqf:ro \
  --overlay /vast/work/public/ml-datasets/imagenet/imagenet-val.sqf:ro \
  --overlay /vast/work/public/ml-datasets/imagenet/imagenet-test.sqf:ro \
	/vast/work/public/singularity/cuda11.7.99-cudnn8.5-devel-ubuntu22.04.2.sif

After running this command, you can find ImageNet dataset at /imagenet/ folder within the root directory. You can create symbolic links to “move” the dataset to wherever you would like, for example


ln -s /imagenet/train/ $VAST/codes/A2MIM/data/ImageNet/
ln -s /imagenet/val/ $VAST/codes/A2MIM/data/ImageNet/
ln -s /imagenet/test/ $VAST/codes/A2MIM/data/ImageNet/

Best Practice

Submit job with SLURM

Check how many resources you can access


sacctmgr list qos format=maxwall,maxtresperuser%40,name

You will see some outputs like below


MaxWall                                MaxTRESPU       Name
----------- ---------------------------------------- ----------
                                                         normal
 2-00:00:00                       cpu=3000,mem=6000G      cpu48
 7-00:00:00                       cpu=1000,mem=2000G     cpu168
 2-00:00:00                              gres/gpu=48      gpu48
 7-00:00:00                               gres/gpu=4     gpu168
   04:00:00                        cpu=48,gres/gpu=4   interact
                                        gres/gpu=128    gpuplus
                                         gres/gpu=24        cds
                                            cpu=5000    cpuplus
                                        gres/gpu=240     gpuamd
   12:00:00                                cpu=30000     cpulow

srun

Interactive session


srun \
  --cpus-per-task=2 \
  --gres=gpu:0 \
  --mem=16GB \
  --time=12:00:00 \
  --job-name=torch-test \
  --pty \
  /bin/bash

Submit a job directly


srun \
	--cpus-per-task=2 \
	--gres=gpu:2 \
	--mem=12GB \
	--time=18:00:00 \
	--job-name=torch-test \
	python train.py

sbatch


#!/bin/bash

#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --gres=gpu:rtx8000:4
#SBATCH --mem=32GB
#SBATCH --time=24:00:00
#SBATCH --job-name=you_job_name
#SBATCH --output=./slurm_logs/slurm-%j.out
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=your@email.com

module purge

echo "SLURM_JOBID: " ${SLURM_JOBID}

WORK_DIR="/fill/your/work/dir/here"
PYTHON_SCRIPT="${WORK_DIR}/main.py"
TORCHRUN_SCRIPT="${WORK_DIR}/scripts/torchrun.sh"

DATASET_NAME="imagenet"
TOTAL_BATCH_SIZE=512
TOTAL_EPOCHS=100

export PYTHON_LAUNCH_CMD="${PYTHON_SCRIPT} \
	--dataset_name ${DATASET_NAME} \
	--bs ${TOTAL_BATCH_SIZE} \
	--ep ${TOTAL_EPOCHS}"

srun --jobid ${SLURM_JOBID} sh ${TORCHRUN_SCRIPT}

${WORK_DIR}/scripts/multi-nodes.sbatch


#!/bin/bash

export TORCH_DISTRIBUTED_DEBUG=INFO
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=1

# cmd cfgs
ENV_LAUNCH_CMD="source $VAST/miniconda3/bin/activate your_env"

# distributed cfgs
NNODE=${1:-${SLURM_NNODES}}
GPUS_PER_NODE=${2:-${SLURM_GPUS_ON_NODE}}
WORLD_SIZE=$((${NNODE}*${GPUS_PER_NODE}))

echo "Running on ${NNODE} nodes with ${GPUS_PER_NODE} GPUs per node."
echo "${WORLD_SIZE} GPUs in total."

NODELIST=$(scontrol show hostname ${SLURM_JOB_NODELIST})
echo "SLURM_NODELIST="${SLURM_NODELIST}

MASTER_ADDR=$(head -n 1 <<< "${NODELIST}")
MASTER_PORT=10033
echo "MASTER_ADDR:MASTER_PORT=${MASTER_ADDR}:${MASTER_PORT}"

# launch cmds
TORCHRUN_LAUNCHER="torchrun \
	--nnodes ${NNODE} \
	--nproc_per_node ${GPUS_PER_NODE} \
  --master_addr ${MASTER_ADDR} \
  --master_port ${MASTER_PORT} \
	--node_rank ${SLURM_PROCID} \
	${PYTHON_LAUNCH_CMD}"

echo ${TORCHRUN_LAUNCHER}

${ENV_LAUNCH_CMD}

${TORCHRUN_LAUNCHER}

# singularity exec \
# 	--nv \
# 	/vast/work/public/singularity/cuda11.3.0-cudnn8-devel-ubuntu20.04.sif \
#   --overlay /vast/work/public/ml-datasets/imagenet/imagenet-train.sqf:ro \
#	  --overlay /vast/work/public/ml-datasets/imagenet/imagenet-val.sqf:ro \
# 	--overlay /vast/work/public/ml-datasets/imagenet/imagenet-test.sqf:ro \
# 	/bin/bash -c "${ENV_LAUNCH_CMD}; ${TORCHRUN_LAUNCHER}"

${WORK_DIR}/scripts/torchrun.sh

To specify different type of GPU, modify the #SBATCH --gres=gpu:2 as below:


#SBATCH --gres=gpu:rtx8000:2
or
#SBATCH --gres=gpu:v100:2
or
#SBATCH --gres=gpu:a100:2

To specify email notification type, modify the #SBATCH --mail-type as below:


#SBATCH --mail-type=ALL # choose from [BEGIN, END, FAIL, ALL]

Useful commands about SLURM

command	explanation	Other Args
`srun --gres=gpu[:v100][:2]`	ㅤ	`-J JobName`
`sbatch [YourFile].SBATCH`	ㅤ	ㅤ
`scancel [JobID]`	ㅤ	ㅤ
`squeue -u $USER`	ㅤ	`-j <JobID>`
`sinfo`	ㅤ	ㅤ
`seff [JobID]`	ㅤ	ㅤ
`scontrol [show job] [JobID]`	ㅤ	ㅤ

Debug

pdb

VScode

SSH configuration


Host NYU-greene-vpn
  HostName greene.hpc.nyu.edu
  User [UserName]

~/.ssh/config

Change the default `.vscode-server` location


# create a new folder to actually save .vscode-server
mkdir $VAST/.vscode-server
# create a soft link pointing from the default location to our new location
ln -s $VAST/.vscode-server/ ~/.vscode-server

Jupyter Lab / Notebook


jupyter lab --no-browser --ip=0.0.0.0

Others

Multiple steps with srun

Launch a Jupyter service


srun --gres=gpu:2 jupyter lab --no-browser --ip=0.0.0.0

Forward the Jupyter service to local

Mount datasets using Singularity


singularity shell \
	--nv \
  --overlay /vast/work/public/ml-datasets/imagenet/imagenet-train.sqf:ro \
  --overlay /vast/work/public/ml-datasets/imagenet/imagenet-val.sqf:ro \
  --overlay /vast/work/public/ml-datasets/imagenet/imagenet-test.sqf:ro \
	/vast/work/public/singularity/cuda11.7.99-cudnn8.5-devel-ubuntu22.04.2.sif

Run Python script


python train.py --devices 0 1

Single step with sbatch


#!/bin/bash

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=2
#SBATCH --gres=gpu:2
#SBATCH --mem=12GB
#SBATCH --time=12:00:00
#SBATCH --job-name=train
#SBATCH --mail-type=END
#SBATCH --mail-user=$USER@nyu.edu

module purge

singularity exec \
	--nv \
	--overlay /vast/work/public/ml-datasets/imagenet/imagenet-train.sqf:ro \
	--overlay /vast/work/public/ml-datasets/imagenet/imagenet-val.sqf:ro \
	--overlay /vast/work/public/ml-datasets/imagenet/imagenet-test.sqf:ro \
	/vast/work/public/singularity/[choose/your/image].sif \
	/bin/bash run.sh


#!/bin/bash

source [/path/to/conda]/bin/activate [EnvName]

python train.py --devices 0 1

run.sh

Python Environment Management with Singularity

Images Folder	Example	Explanation
`/scratch/work/public/overlay-fs-ext3` `/vast/work/public/overlay-fs-ext3`	overlay-50G-10M.ext3.gz	50GB free space and 10M files to hold
`/scratch/work/public/singularity/` `/vast/work/public/singularity/`	cuda12.1.1-cudnn8.9.0-devel-ubuntu22.04.2.sif	ㅤ
ㅤ	ㅤ	ㅤ

Create your own overlay image

Choose a desired image and copy it to your own folder and decompress it, for example


mkdir $SCRATCH/singularity-images/
cd $SCRATCH/singularity-images/
cp /scratch/work/public/overlay-fs-ext3/overlay-50G-10M.ext3.gz .
gunzip overlay-50G-10M.ext3.gz

Launch a Singularity container from the existing image


singularity shell \
	--overlay $SCRATCH/singularity-images/overlay-50G-10M.ext3:rw \
	/vast/work/public/singularity/cuda11.7.99-cudnn8.5-devel-ubuntu22.04.2.sif

After running this command, there should a /ext3/ folder in the root directory, which can only be found when in the Singularity container. Put all things you want to write into the image in the /ext3/ folder, including Miniconda environment.

You can run nvcc -V to check if the image is loaded correctly. You should expect something like below, which won’t appear unless in the Singularity container.


nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0

Install Miniconda and Python


wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -p /ext3/miniconda3

Create a bash script for activating Conda


#!/bin/bash

source /ext3/miniconda3/etc/profile.d/conda.sh
export PATH=/ext3/miniconda3/bin:$PATH
export PYTHONPATH=/ext3/miniconda3/bin:$PATH

Exit Singularity container and rename the image to whatever you want, for example


exit  # or press Ctrl+D
mv overlay-50G-10M.ext3 cp38-torch1.13-cu117.ext3

Run Python Script with self-created Singularity image


PORT=29500 bash tools/dist_train.sh configs/openmixup/pretrain/a2mim/imagenet/r50_l3_sz224_init_8xb256_cos_ep300.py 2 > debug.log 2>&1


python main.py --fname configs/in1k_vith14_ep300_debug.yaml --devices cuda:0 cuda:1 | tee debug.log


singularity exec --overlay $SCRATCH/singularity-images/cp38-torch1.13-cu117.ext3:ro /vast/work/public/singularity/cuda11.7.99-cudnn8.5-devel-ubuntu22.04.2.sif /bin/bash


singularity shell /vast/work/public/singularity/cuda11.7.99-cudnn8.5-devel-ubuntu22.04.2.sif


singularity exec \
	--overlay $SCRATCH/singularity-images/cp38-torch1.13-cu117.ext3:ro \
	/vast/work/public/singularity/cuda11.7.99-cudnn8.5-devel-ubuntu22.04.2.sif \
	/bin/bash


srun \
	--cpus-per-task=2 \
	--gres=gpu:2 \
	--mem=12GB \
	--time=18:00:00 \
	--job-name=torch-test \
	singularity exec \
	--overlay $SCRATCH/singularity-images/cp38-torch1.13-cu117.ext3:ro \
  --overlay /vast/work/public/ml-datasets/imagenet/imagenet-train.sqf:ro \
  --overlay /vast/work/public/ml-datasets/imagenet/imagenet-val.sqf:ro \
  --overlay /vast/work/public/ml-datasets/imagenet/imagenet-test.sqf:ro \
	/vast/work/public/singularity/cuda12.2.2-cudnn8.9.4-devel-ubuntu22.04.3.sif \
	/bin/bash


#!/bin/bash

singularity exec \
	--nv \
	--overlay $SCRATCH/singularity-images/cp38-torch1.13-cu117.ext3:ro \
	/vast/work/public/singularity/cuda11.7.99-cudnn8.5-devel-ubuntu22.04.2.sif \
	/bin/bash -c "source /ext3/env.sh"