NYU Greene HPC
🖥️

NYU Greene HPC

Tags
Public
Public
Published
Last Updated
Last updated February 5, 2024
 

Quick Start

Application

Please refer to NYU HPC pages for details.

Useful commands

command
explanation
myquota
查看自己当前的存储配额状况

Storage Usage

Environment Variable
Location
Suggest Usage
Initial Quota (Space/Files)
$HOME
/home/$USER/
Personal user home space that is best for small files OR DON’T USE AT ALL
50.0GB / 30.0K
$SCRATCH
/scratch/$USER/
Best for large files
5.0TB / 1.0M
$VAST
/vast/$USER/
Flash memory for high I/O workflows
2.0TB / 5.0M
$ARCHIVE
/archive/$USER/
Long-term storage
2.0TB / 20.0K
 

Datasets

💡
For better IO performance, it’s recommended to use datasets stored at /vast
Dataset Name
Folder
Comments
ImageNet
/scratch/work/public/ml-datasets/imagenet /vast/work/public/ml-datasets/imagenet
MS-COCO
/scratch/work/public/ml-datasets/coco/coco-2014.sqf /scratch/work/public/ml-datasets/coco/coco-2015.sqf /scratch/work/public/ml-datasets/coco/coco-2017.sqf /vast/work/public/ml-datasets/coco/coco-2014.sqf /vast/work/public/ml-datasets/coco/coco-2015.sqf /vast/work/public/ml-datasets/coco/coco-2017.sqf

Access datasets with Singularity

singularity exec \ --overlay /<path>/pytorch1.8.0-cuda11.1.ext3:ro \ --overlay /vast/work/public/ml-datasets/coco/coco-2014.sqf:ro \ --overlay /vast/work/public/ml-datasets/coco/coco-2015.sqf:ro \ --overlay /vast/work/public/ml-datasets/coco/coco-2017.sqf:ro \ /scratch/work/public/singularity/cuda11.1-cudnn8-devel-ubuntu18.04.sif \ /bin/bash
singularity shell \ --nv \ --overlay /vast/work/public/ml-datasets/imagenet/imagenet-train.sqf:ro \ --overlay /vast/work/public/ml-datasets/imagenet/imagenet-val.sqf:ro \ --overlay /vast/work/public/ml-datasets/imagenet/imagenet-test.sqf:ro \ /vast/work/public/singularity/cuda11.7.99-cudnn8.5-devel-ubuntu22.04.2.sif
After running this command, you can find ImageNet dataset at /imagenet/ folder within the root directory. You can create symbolic links to “move” the dataset to wherever you would like, for example
ln -s /imagenet/train/ $VAST/codes/A2MIM/data/ImageNet/ ln -s /imagenet/val/ $VAST/codes/A2MIM/data/ImageNet/ ln -s /imagenet/test/ $VAST/codes/A2MIM/data/ImageNet/
 

Best Practice

Submit job with SLURM

Check how many resources you can access

sacctmgr list qos format=maxwall,maxtresperuser%40,name
You will see some outputs like below
MaxWall MaxTRESPU Name ----------- ---------------------------------------- ---------- normal 2-00:00:00 cpu=3000,mem=6000G cpu48 7-00:00:00 cpu=1000,mem=2000G cpu168 2-00:00:00 gres/gpu=48 gpu48 7-00:00:00 gres/gpu=4 gpu168 04:00:00 cpu=48,gres/gpu=4 interact gres/gpu=128 gpuplus gres/gpu=24 cds cpu=5000 cpuplus gres/gpu=240 gpuamd 12:00:00 cpu=30000 cpulow
notion image

srun

  • Interactive session
    • srun \ --cpus-per-task=2 \ --gres=gpu:0 \ --mem=16GB \ --time=12:00:00 \ --job-name=torch-test \ --pty \ /bin/bash
  • Submit a job directly
    • srun \ --cpus-per-task=2 \ --gres=gpu:2 \ --mem=12GB \ --time=18:00:00 \ --job-name=torch-test \ python train.py

sbatch

#!/bin/bash #SBATCH --nodes=2 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=8 #SBATCH --gres=gpu:rtx8000:4 #SBATCH --mem=32GB #SBATCH --time=24:00:00 #SBATCH --job-name=you_job_name #SBATCH --output=./slurm_logs/slurm-%j.out #SBATCH --mail-type=END,FAIL #SBATCH --mail-user=your@email.com module purge echo "SLURM_JOBID: " ${SLURM_JOBID} WORK_DIR="/fill/your/work/dir/here" PYTHON_SCRIPT="${WORK_DIR}/main.py" TORCHRUN_SCRIPT="${WORK_DIR}/scripts/torchrun.sh" DATASET_NAME="imagenet" TOTAL_BATCH_SIZE=512 TOTAL_EPOCHS=100 export PYTHON_LAUNCH_CMD="${PYTHON_SCRIPT} \ --dataset_name ${DATASET_NAME} \ --bs ${TOTAL_BATCH_SIZE} \ --ep ${TOTAL_EPOCHS}" srun --jobid ${SLURM_JOBID} sh ${TORCHRUN_SCRIPT}
${WORK_DIR}/scripts/multi-nodes.sbatch
#!/bin/bash export TORCH_DISTRIBUTED_DEBUG=INFO export NCCL_DEBUG=INFO export NCCL_IB_DISABLE=1 # cmd cfgs ENV_LAUNCH_CMD="source $VAST/miniconda3/bin/activate your_env" # distributed cfgs NNODE=${1:-${SLURM_NNODES}} GPUS_PER_NODE=${2:-${SLURM_GPUS_ON_NODE}} WORLD_SIZE=$((${NNODE}*${GPUS_PER_NODE})) echo "Running on ${NNODE} nodes with ${GPUS_PER_NODE} GPUs per node." echo "${WORLD_SIZE} GPUs in total." NODELIST=$(scontrol show hostname ${SLURM_JOB_NODELIST}) echo "SLURM_NODELIST="${SLURM_NODELIST} MASTER_ADDR=$(head -n 1 <<< "${NODELIST}") MASTER_PORT=10033 echo "MASTER_ADDR:MASTER_PORT=${MASTER_ADDR}:${MASTER_PORT}" # launch cmds TORCHRUN_LAUNCHER="torchrun \ --nnodes ${NNODE} \ --nproc_per_node ${GPUS_PER_NODE} \ --master_addr ${MASTER_ADDR} \ --master_port ${MASTER_PORT} \ --node_rank ${SLURM_PROCID} \ ${PYTHON_LAUNCH_CMD}" echo ${TORCHRUN_LAUNCHER} ${ENV_LAUNCH_CMD} ${TORCHRUN_LAUNCHER} # singularity exec \ # --nv \ # /vast/work/public/singularity/cuda11.3.0-cudnn8-devel-ubuntu20.04.sif \ # --overlay /vast/work/public/ml-datasets/imagenet/imagenet-train.sqf:ro \ # --overlay /vast/work/public/ml-datasets/imagenet/imagenet-val.sqf:ro \ # --overlay /vast/work/public/ml-datasets/imagenet/imagenet-test.sqf:ro \ # /bin/bash -c "${ENV_LAUNCH_CMD}; ${TORCHRUN_LAUNCHER}"
${WORK_DIR}/scripts/torchrun.sh
  • To specify different type of GPU, modify the #SBATCH --gres=gpu:2 as below:
    • #SBATCH --gres=gpu:rtx8000:2 or #SBATCH --gres=gpu:v100:2 or #SBATCH --gres=gpu:a100:2
  • To specify email notification type, modify the #SBATCH --mail-type as below:
    • #SBATCH --mail-type=ALL # choose from [BEGIN, END, FAIL, ALL]
 

Useful commands about SLURM

command
explanation
Other Args
srun --gres=gpu[:v100][:2]
-J JobName
sbatch [YourFile].SBATCH
scancel [JobID]
squeue -u $USER
-j <JobID>
sinfo
seff [JobID]
scontrol [show job] [JobID]

Debug

pdb

 

VScode

 

VScode

SSH configuration

Host NYU-greene-vpn HostName greene.hpc.nyu.edu User [UserName]
~/.ssh/config

Change the default .vscode-server location

# create a new folder to actually save .vscode-server mkdir $VAST/.vscode-server # create a soft link pointing from the default location to our new location ln -s $VAST/.vscode-server/ ~/.vscode-server
 

Jupyter Lab / Notebook

jupyter lab --no-browser --ip=0.0.0.0
 

Others

Multiple steps with srun

  1. Launch a Jupyter service
    1. srun --gres=gpu:2 jupyter lab --no-browser --ip=0.0.0.0
  1. Forward the Jupyter service to local
  1. Mount datasets using Singularity
    1. singularity shell \ --nv \ --overlay /vast/work/public/ml-datasets/imagenet/imagenet-train.sqf:ro \ --overlay /vast/work/public/ml-datasets/imagenet/imagenet-val.sqf:ro \ --overlay /vast/work/public/ml-datasets/imagenet/imagenet-test.sqf:ro \ /vast/work/public/singularity/cuda11.7.99-cudnn8.5-devel-ubuntu22.04.2.sif
  1. Run Python script
    1. python train.py --devices 0 1

Single step with sbatch

#!/bin/bash #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=2 #SBATCH --gres=gpu:2 #SBATCH --mem=12GB #SBATCH --time=12:00:00 #SBATCH --job-name=train #SBATCH --mail-type=END #SBATCH --mail-user=$USER@nyu.edu module purge singularity exec \ --nv \ --overlay /vast/work/public/ml-datasets/imagenet/imagenet-train.sqf:ro \ --overlay /vast/work/public/ml-datasets/imagenet/imagenet-val.sqf:ro \ --overlay /vast/work/public/ml-datasets/imagenet/imagenet-test.sqf:ro \ /vast/work/public/singularity/[choose/your/image].sif \ /bin/bash run.sh
#!/bin/bash source [/path/to/conda]/bin/activate [EnvName] python train.py --devices 0 1
run.sh

Python Environment Management with Singularity

Images Folder
Example
Explanation
/scratch/work/public/overlay-fs-ext3 /vast/work/public/overlay-fs-ext3
overlay-50G-10M.ext3.gz
50GB free space and 10M files to hold
/scratch/work/public/singularity/ /vast/work/public/singularity/
cuda12.1.1-cudnn8.9.0-devel-ubuntu22.04.2.sif

Create your own overlay image

  1. Choose a desired image and copy it to your own folder and decompress it, for example
    1. mkdir $SCRATCH/singularity-images/ cd $SCRATCH/singularity-images/ cp /scratch/work/public/overlay-fs-ext3/overlay-50G-10M.ext3.gz . gunzip overlay-50G-10M.ext3.gz
  1. Launch a Singularity container from the existing image
    1. singularity shell \ --overlay $SCRATCH/singularity-images/overlay-50G-10M.ext3:rw \ /vast/work/public/singularity/cuda11.7.99-cudnn8.5-devel-ubuntu22.04.2.sif
      After running this command, there should a /ext3/ folder in the root directory, which can only be found when in the Singularity container. Put all things you want to write into the image in the /ext3/ folder, including Miniconda environment.
      You can run nvcc -V to check if the image is loaded correctly. You should expect something like below, which won’t appear unless in the Singularity container.
      nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Jun__8_16:49:14_PDT_2022 Cuda compilation tools, release 11.7, V11.7.99 Build cuda_11.7.r11.7/compiler.31442593_0
  1. Install Miniconda and Python
    1. wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh -b -p /ext3/miniconda3
  1. Create a bash script for activating Conda
    1. #!/bin/bash source /ext3/miniconda3/etc/profile.d/conda.sh export PATH=/ext3/miniconda3/bin:$PATH export PYTHONPATH=/ext3/miniconda3/bin:$PATH
    1. Exit Singularity container and rename the image to whatever you want, for example
      1. exit # or press Ctrl+D mv overlay-50G-10M.ext3 cp38-torch1.13-cu117.ext3

      Run Python Script with self-created Singularity image

      PORT=29500 bash tools/dist_train.sh configs/openmixup/pretrain/a2mim/imagenet/r50_l3_sz224_init_8xb256_cos_ep300.py 2 > debug.log 2>&1
      python main.py --fname configs/in1k_vith14_ep300_debug.yaml --devices cuda:0 cuda:1 | tee debug.log
      singularity exec --overlay $SCRATCH/singularity-images/cp38-torch1.13-cu117.ext3:ro /vast/work/public/singularity/cuda11.7.99-cudnn8.5-devel-ubuntu22.04.2.sif /bin/bash
      singularity shell /vast/work/public/singularity/cuda11.7.99-cudnn8.5-devel-ubuntu22.04.2.sif
      singularity exec \ --overlay $SCRATCH/singularity-images/cp38-torch1.13-cu117.ext3:ro \ /vast/work/public/singularity/cuda11.7.99-cudnn8.5-devel-ubuntu22.04.2.sif \ /bin/bash
      srun \ --cpus-per-task=2 \ --gres=gpu:2 \ --mem=12GB \ --time=18:00:00 \ --job-name=torch-test \ singularity exec \ --overlay $SCRATCH/singularity-images/cp38-torch1.13-cu117.ext3:ro \ --overlay /vast/work/public/ml-datasets/imagenet/imagenet-train.sqf:ro \ --overlay /vast/work/public/ml-datasets/imagenet/imagenet-val.sqf:ro \ --overlay /vast/work/public/ml-datasets/imagenet/imagenet-test.sqf:ro \ /vast/work/public/singularity/cuda12.2.2-cudnn8.9.4-devel-ubuntu22.04.3.sif \ /bin/bash
      #!/bin/bash singularity exec \ --nv \ --overlay $SCRATCH/singularity-images/cp38-torch1.13-cu117.ext3:ro \ /vast/work/public/singularity/cuda11.7.99-cudnn8.5-devel-ubuntu22.04.2.sif \ /bin/bash -c "source /ext3/env.sh"

      References