Quick StartApplicationUseful commandsStorage UsageDatasetsAccess datasets with SingularityBest PracticeSubmit job with SLURMCheck how many resources you can accesssrunsbatchUseful commands about SLURMDebugpdbVScodeVScodeSSH configurationChange the default .vscode-server locationJupyter Lab / NotebookOthersMultiple steps with srunSingle step with sbatchPython Environment Management with SingularityCreate your own overlay imageRun Python Script with self-created Singularity imageReferences
Quick Start
Application
Please refer to NYU HPC pages for details.
Useful commands
command | explanation | ㅤ |
myquota | 查看自己当前的存储配额状况 | ㅤ |
ㅤ | ㅤ | ㅤ |
Storage Usage
Environment Variable | Location | Suggest Usage | Initial Quota (Space/Files) |
$HOME | /home/$USER/ | Personal user home space that is best for small files
OR DON’T USE AT ALL | 50.0GB / 30.0K |
$SCRATCH | /scratch/$USER/ | Best for large files | 5.0TB / 1.0M |
$VAST | /vast/$USER/ | Flash memory for high I/O workflows | 2.0TB / 5.0M |
$ARCHIVE | /archive/$USER/ | Long-term storage | 2.0TB / 20.0K |
ㅤ | ㅤ | ㅤ | ㅤ |
Datasets
For better IO performance, it’s recommended to use datasets stored at /vast
Dataset Name | Folder | Comments |
ImageNet | /scratch/work/public/ml-datasets/imagenet
/vast/work/public/ml-datasets/imagenet | Need to apply for access |
MS-COCO | /scratch/work/public/ml-datasets/coco/coco-2014.sqf
/scratch/work/public/ml-datasets/coco/coco-2015.sqf
/scratch/work/public/ml-datasets/coco/coco-2017.sqf
/vast/work/public/ml-datasets/coco/coco-2014.sqf
/vast/work/public/ml-datasets/coco/coco-2015.sqf
/vast/work/public/ml-datasets/coco/coco-2017.sqf | ㅤ |
ㅤ | ㅤ | ㅤ |
Access datasets with Singularity
singularity exec \ --overlay /<path>/pytorch1.8.0-cuda11.1.ext3:ro \ --overlay /vast/work/public/ml-datasets/coco/coco-2014.sqf:ro \ --overlay /vast/work/public/ml-datasets/coco/coco-2015.sqf:ro \ --overlay /vast/work/public/ml-datasets/coco/coco-2017.sqf:ro \ /scratch/work/public/singularity/cuda11.1-cudnn8-devel-ubuntu18.04.sif \ /bin/bash
singularity shell \ --nv \ --overlay /vast/work/public/ml-datasets/imagenet/imagenet-train.sqf:ro \ --overlay /vast/work/public/ml-datasets/imagenet/imagenet-val.sqf:ro \ --overlay /vast/work/public/ml-datasets/imagenet/imagenet-test.sqf:ro \ /vast/work/public/singularity/cuda11.7.99-cudnn8.5-devel-ubuntu22.04.2.sif
After running this command, you can find ImageNet dataset at
/imagenet/
folder within the root directory. You can create symbolic links to “move” the dataset to wherever you would like, for exampleln -s /imagenet/train/ $VAST/codes/A2MIM/data/ImageNet/ ln -s /imagenet/val/ $VAST/codes/A2MIM/data/ImageNet/ ln -s /imagenet/test/ $VAST/codes/A2MIM/data/ImageNet/
Best Practice
Submit job with SLURM
Check how many resources you can access
sacctmgr list qos format=maxwall,maxtresperuser%40,name
You will see some outputs like below
MaxWall MaxTRESPU Name ----------- ---------------------------------------- ---------- normal 2-00:00:00 cpu=3000,mem=6000G cpu48 7-00:00:00 cpu=1000,mem=2000G cpu168 2-00:00:00 gres/gpu=48 gpu48 7-00:00:00 gres/gpu=4 gpu168 04:00:00 cpu=48,gres/gpu=4 interact gres/gpu=128 gpuplus gres/gpu=24 cds cpu=5000 cpuplus gres/gpu=240 gpuamd 12:00:00 cpu=30000 cpulow
srun
- Interactive session
srun \ --cpus-per-task=2 \ --gres=gpu:0 \ --mem=16GB \ --time=12:00:00 \ --job-name=torch-test \ --pty \ /bin/bash
- Submit a job directly
srun \ --cpus-per-task=2 \ --gres=gpu:2 \ --mem=12GB \ --time=18:00:00 \ --job-name=torch-test \ python train.py
sbatch
#!/bin/bash #SBATCH --nodes=2 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=8 #SBATCH --gres=gpu:rtx8000:4 #SBATCH --mem=32GB #SBATCH --time=24:00:00 #SBATCH --job-name=you_job_name #SBATCH --output=./slurm_logs/slurm-%j.out #SBATCH --mail-type=END,FAIL #SBATCH --mail-user=your@email.com module purge echo "SLURM_JOBID: " ${SLURM_JOBID} WORK_DIR="/fill/your/work/dir/here" PYTHON_SCRIPT="${WORK_DIR}/main.py" TORCHRUN_SCRIPT="${WORK_DIR}/scripts/torchrun.sh" DATASET_NAME="imagenet" TOTAL_BATCH_SIZE=512 TOTAL_EPOCHS=100 export PYTHON_LAUNCH_CMD="${PYTHON_SCRIPT} \ --dataset_name ${DATASET_NAME} \ --bs ${TOTAL_BATCH_SIZE} \ --ep ${TOTAL_EPOCHS}" srun --jobid ${SLURM_JOBID} sh ${TORCHRUN_SCRIPT}
#!/bin/bash export TORCH_DISTRIBUTED_DEBUG=INFO export NCCL_DEBUG=INFO export NCCL_IB_DISABLE=1 # cmd cfgs ENV_LAUNCH_CMD="source $VAST/miniconda3/bin/activate your_env" # distributed cfgs NNODE=${1:-${SLURM_NNODES}} GPUS_PER_NODE=${2:-${SLURM_GPUS_ON_NODE}} WORLD_SIZE=$((${NNODE}*${GPUS_PER_NODE})) echo "Running on ${NNODE} nodes with ${GPUS_PER_NODE} GPUs per node." echo "${WORLD_SIZE} GPUs in total." NODELIST=$(scontrol show hostname ${SLURM_JOB_NODELIST}) echo "SLURM_NODELIST="${SLURM_NODELIST} MASTER_ADDR=$(head -n 1 <<< "${NODELIST}") MASTER_PORT=10033 echo "MASTER_ADDR:MASTER_PORT=${MASTER_ADDR}:${MASTER_PORT}" # launch cmds TORCHRUN_LAUNCHER="torchrun \ --nnodes ${NNODE} \ --nproc_per_node ${GPUS_PER_NODE} \ --master_addr ${MASTER_ADDR} \ --master_port ${MASTER_PORT} \ --node_rank ${SLURM_PROCID} \ ${PYTHON_LAUNCH_CMD}" echo ${TORCHRUN_LAUNCHER} ${ENV_LAUNCH_CMD} ${TORCHRUN_LAUNCHER} # singularity exec \ # --nv \ # /vast/work/public/singularity/cuda11.3.0-cudnn8-devel-ubuntu20.04.sif \ # --overlay /vast/work/public/ml-datasets/imagenet/imagenet-train.sqf:ro \ # --overlay /vast/work/public/ml-datasets/imagenet/imagenet-val.sqf:ro \ # --overlay /vast/work/public/ml-datasets/imagenet/imagenet-test.sqf:ro \ # /bin/bash -c "${ENV_LAUNCH_CMD}; ${TORCHRUN_LAUNCHER}"
- To specify different type of GPU, modify the
#SBATCH --gres=gpu:2
as below:
#SBATCH --gres=gpu:rtx8000:2 or #SBATCH --gres=gpu:v100:2 or #SBATCH --gres=gpu:a100:2
- To specify email notification type, modify the
#SBATCH --mail-type
as below:
#SBATCH --mail-type=ALL # choose from [BEGIN, END, FAIL, ALL]
Useful commands about SLURM
command | explanation | Other Args |
srun --gres=gpu[:v100][:2] | ㅤ | -J JobName |
sbatch [YourFile].SBATCH | ㅤ | ㅤ |
scancel [JobID] | ㅤ | ㅤ |
squeue -u $USER | ㅤ | -j <JobID> |
sinfo | ㅤ | ㅤ |
seff [JobID] | ㅤ | ㅤ |
scontrol [show job] [JobID] | ㅤ | ㅤ |
Debug
pdb
VScode
VScode
SSH configuration
Host NYU-greene-vpn HostName greene.hpc.nyu.edu User [UserName]
Change the default .vscode-server
location
# create a new folder to actually save .vscode-server mkdir $VAST/.vscode-server # create a soft link pointing from the default location to our new location ln -s $VAST/.vscode-server/ ~/.vscode-server
Jupyter Lab / Notebook
jupyter lab --no-browser --ip=0.0.0.0
Others
Multiple steps with srun
- Launch a Jupyter service
srun --gres=gpu:2 jupyter lab --no-browser --ip=0.0.0.0
- Forward the Jupyter service to local
- Mount datasets using Singularity
singularity shell \ --nv \ --overlay /vast/work/public/ml-datasets/imagenet/imagenet-train.sqf:ro \ --overlay /vast/work/public/ml-datasets/imagenet/imagenet-val.sqf:ro \ --overlay /vast/work/public/ml-datasets/imagenet/imagenet-test.sqf:ro \ /vast/work/public/singularity/cuda11.7.99-cudnn8.5-devel-ubuntu22.04.2.sif
- Run Python script
python train.py --devices 0 1
Single step with sbatch
#!/bin/bash #SBATCH --nodes=1 #SBATCH --ntasks-per-node=1 #SBATCH --cpus-per-task=2 #SBATCH --gres=gpu:2 #SBATCH --mem=12GB #SBATCH --time=12:00:00 #SBATCH --job-name=train #SBATCH --mail-type=END #SBATCH --mail-user=$USER@nyu.edu module purge singularity exec \ --nv \ --overlay /vast/work/public/ml-datasets/imagenet/imagenet-train.sqf:ro \ --overlay /vast/work/public/ml-datasets/imagenet/imagenet-val.sqf:ro \ --overlay /vast/work/public/ml-datasets/imagenet/imagenet-test.sqf:ro \ /vast/work/public/singularity/[choose/your/image].sif \ /bin/bash run.sh
#!/bin/bash source [/path/to/conda]/bin/activate [EnvName] python train.py --devices 0 1
Python Environment Management with Singularity
Images Folder | Example | Explanation |
/scratch/work/public/overlay-fs-ext3
/vast/work/public/overlay-fs-ext3 | overlay-50G-10M.ext3.gz | 50GB free space and 10M files to hold |
/scratch/work/public/singularity/
/vast/work/public/singularity/ | cuda12.1.1-cudnn8.9.0-devel-ubuntu22.04.2.sif | ㅤ |
ㅤ | ㅤ | ㅤ |
Create your own overlay image
- Choose a desired image and copy it to your own folder and decompress it, for example
mkdir $SCRATCH/singularity-images/ cd $SCRATCH/singularity-images/ cp /scratch/work/public/overlay-fs-ext3/overlay-50G-10M.ext3.gz . gunzip overlay-50G-10M.ext3.gz
- Launch a Singularity container from the existing image
singularity shell \ --overlay $SCRATCH/singularity-images/overlay-50G-10M.ext3:rw \ /vast/work/public/singularity/cuda11.7.99-cudnn8.5-devel-ubuntu22.04.2.sif
After running this command, there should a
/ext3/
folder in the root directory, which can only be found when in the Singularity container. Put all things you want to write into the image in the /ext3/
folder, including Miniconda environment.You can run
nvcc -V
to check if the image is loaded correctly. You should expect something like below, which won’t appear unless in the Singularity container.nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Jun__8_16:49:14_PDT_2022 Cuda compilation tools, release 11.7, V11.7.99 Build cuda_11.7.r11.7/compiler.31442593_0
- Install Miniconda and Python
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh -b -p /ext3/miniconda3
- Create a bash script for activating Conda
#!/bin/bash source /ext3/miniconda3/etc/profile.d/conda.sh export PATH=/ext3/miniconda3/bin:$PATH export PYTHONPATH=/ext3/miniconda3/bin:$PATH
- Exit Singularity container and rename the image to whatever you want, for example
exit # or press Ctrl+D mv overlay-50G-10M.ext3 cp38-torch1.13-cu117.ext3
Run Python Script with self-created Singularity image
PORT=29500 bash tools/dist_train.sh configs/openmixup/pretrain/a2mim/imagenet/r50_l3_sz224_init_8xb256_cos_ep300.py 2 > debug.log 2>&1
python main.py --fname configs/in1k_vith14_ep300_debug.yaml --devices cuda:0 cuda:1 | tee debug.log
singularity exec --overlay $SCRATCH/singularity-images/cp38-torch1.13-cu117.ext3:ro /vast/work/public/singularity/cuda11.7.99-cudnn8.5-devel-ubuntu22.04.2.sif /bin/bash
singularity shell /vast/work/public/singularity/cuda11.7.99-cudnn8.5-devel-ubuntu22.04.2.sif
singularity exec \ --overlay $SCRATCH/singularity-images/cp38-torch1.13-cu117.ext3:ro \ /vast/work/public/singularity/cuda11.7.99-cudnn8.5-devel-ubuntu22.04.2.sif \ /bin/bash
srun \ --cpus-per-task=2 \ --gres=gpu:2 \ --mem=12GB \ --time=18:00:00 \ --job-name=torch-test \ singularity exec \ --overlay $SCRATCH/singularity-images/cp38-torch1.13-cu117.ext3:ro \ --overlay /vast/work/public/ml-datasets/imagenet/imagenet-train.sqf:ro \ --overlay /vast/work/public/ml-datasets/imagenet/imagenet-val.sqf:ro \ --overlay /vast/work/public/ml-datasets/imagenet/imagenet-test.sqf:ro \ /vast/work/public/singularity/cuda12.2.2-cudnn8.9.4-devel-ubuntu22.04.3.sif \ /bin/bash
#!/bin/bash singularity exec \ --nv \ --overlay $SCRATCH/singularity-images/cp38-torch1.13-cu117.ext3:ro \ /vast/work/public/singularity/cuda11.7.99-cudnn8.5-devel-ubuntu22.04.2.sif \ /bin/bash -c "source /ext3/env.sh"
References
- Getting Started
- Singularity
- Singularity & Miniconda Python environment
- Singularity & Datasets
- Princeton tutorial
- Conda
- Datasets
- SLURM