1

I have:

#!/bin/bash

echo -- Start my submission file

export SLURM_JOBID=$(((RANDOM)))
echo SLURM_JOBID = $SLURM_JOBID

#export CUDA_VISIBLE_DEVICES=$(((RANDOM%8)))
#export CUDA_VISIBLE_DEVICES=0
#export CUDA_VISIBLE_DEVICES=1
#export CUDA_VISIBLE_DEVICES=2
#export CUDA_VISIBLE_DEVICES=3
#export CUDA_VISIBLE_DEVICES=4
#export CUDA_VISIBLE_DEVICES=5
#export CUDA_VISIBLE_DEVICES=6
#export CUDA_VISIBLE_DEVICES=7
#export CUDA_VISIBLE_DEVICES=4,5,6,7
#export CUDA_VISIBLE_DEVICES=0,1,2,3
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
#export CUDA_VISIBLE_DEVICES=0,1,3,4,5,6,7

echo CUDA_VISIBLE_DEVICES
echo $CUDA_VISIBLE_DEVICES
echo torch.cuda.device_count is:
python -c "import torch; print(torch.cuda.device_count())"
echo ---- Running your python main ----

pip install wandb --upgrade

#export SLURM_JOBID=-1
#python -u ~/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main2_metalearning.py --manual_loads_name manual_load_cifarfs_resnet12rfs_maml > $OUT_FILE &

# - SL
#export OUT_FILE=$PWD/main.sh.o$SLURM_JOBID
#python -u ~/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_sl_with_ddp.py --manual_loads_name sl_mi_rfs_5cnn_adam_cl_200 > $OUT_FILE &
#python -u ~/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_sl_with_ddp.py --manual_loads_name sl_mi_rfs_resnet_rfs_mi_adam_cl_200 > $OUT_FILE &

#python -u ~/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_sl_with_ddp.py --manual_loads_name sl_cifarfs_rfs_resnet12rfs_adam_cl_200 > $OUT_FILE &
#python -u ~/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_sl_with_ddp.py --manual_loads_name sl_cifarfs_rfs_resnet12rfs_adam_cl_600 > $OUT_FILE &
#python -u ~/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_sl_with_ddp.py --manual_loads_name sl_cifarfs_rfs_4cnn_adam_cl_200 > $OUT_FILE &
#python -u ~/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_sl_with_ddp.py --manual_loads_name sl_cifarfs_rfs_4cnn_adam_cl_600 > $OUT_FILE &
#echo pid = $!
#echo CUDA_VISIBLE_DEVICES = $CUDA_VISIBLE_DEVICES
#echo SLURM_JOBID = $SLURM_JOBID

# - MAML
export OUT_FILE=$PWD/main.sh.o$SLURM_JOBID
#python -m torch.distributed.run --nproc_per_node=4 ~/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_dist_maml_l2l.py --manual_loads_name l2l_resnet12rfs_cifarfs_rfs_adam_cl_100k > $OUT_FILE &
#python -m torch.distributed.run --nproc_per_node=4 ~/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_dist_maml_l2l.py --manual_loads_name l2l_4CNNl2l_cifarfs_rfs_adam_cl_70k > $OUT_FILE &

python -m torch.distributed.run --nproc_per_node=8 ~/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_dist_maml_l2l.py --manual_loads_name l2l_resnet12rfs_mi_rfs_adam_cl_100k > $OUT_FILE &
echo pid = $!
echo CUDA_VISIBLE_DEVICES = $CUDA_VISIBLE_DEVICES
echo SLURM_JOBID = $SLURM_JOBID

# - Data analysis
#python -u ~/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main2_distance_sl_vs_maml.py
#python -u ~/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/_main_distance_sl_vs_maml.py

echo -- Done submitting job in dgx A100-SXM4-40G

so clearly there are 8 gpus e.g.

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

How do I do:

length(CUDA_VISIBLE_DEVICES)

and pass it directly in my bash script? This would be trivial in python.

related:

Charlie Parker
  • 5,884
  • 57
  • 198
  • 323
  • 2
    `so clearly` could you explain how is it clear? How do you query the number of GPUs? `xport CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7` you want to filter the line `CUDA_VISIBLE_DEVICES` from a file and count the number of digits after `=` sign? Is that line in a file related to number of GPUs? – KamilCuk Feb 15 '22 at 20:49

2 Answers2

0

This is not an optimal solution, but can be helpful:

We can use bash array to count the number of words. But bash uses space as a delimiter to count words. Hence, we need to give space after commas in CUDA_VISIBLE_DEVICES as follows:

export CUDA_VISIBLE_DEVICES='2, 1, 3' # Don't forget to give spaces after commas!

Given the above, we can use the bash array technique to count number of words to count number of GPUs specified as follows:

export CUDA_VISIBLE_DEVICES='2, 1, 3' # Don't forget to give spaces after commas!
CVD=($CUDA_VISIBLE_DEVICES) # create bash array from specified CUDA_VISIBLE_DEVICES
NUM_GPUS=${#CVD[@]} # count number of space limited words in CVD bash array
echo $NUM_GPUS # print to confirm that $NUM_GPUS is set correctly
mpiexec -n $NUM_GPUS python train.py ... # use $NUM_GPUS as per your requirement, e.g., mpiexec for distributed GPU training.
omsrisagar
  • 462
  • 1
  • 5
  • 15
0

Following this answer and the above answer by @omsrisagar, you could also implement in the following way:

export CUDA_VISIBLE_DEVICES=4,5
CVD=(${CUDA_VISIBLE_DEVICES//,/ })
NUM_GPUS=${#CVD[@]}
echo $NUM_GPUS
...
JQK
  • 93
  • 1
  • 9