11

When I submit a SLURM job with the option --gres=gpu:1 to a node with two GPUs, how can I get the ID of the GPU which is allocated for the job? Is there an environment variable for this purpose? The GPUs I'm using are all nvidia GPUs. Thanks.

talonmies
  • 70,661
  • 34
  • 192
  • 269
Negelis
  • 376
  • 4
  • 17

3 Answers3

6

You can get the GPU id with the environment variable CUDA_VISIBLE_DEVICES. This variable is a comma separated list of the GPU ids assigned to the job.

Carles Fenoy
  • 4,740
  • 1
  • 26
  • 27
  • It works. Thanks. It seems that the environment variable GPU_DEVICE_ORDINAL also works. – Negelis May 14 '17 at 20:40
  • 6
    This doesn't identify the GPU uniquely when using cgroups. With cgroups, CUDA_VISIBLE_DEVICES would be 0 for all GPUs because each process only sees a single GPU (others are hidden by the cgroup). – isarandi Jun 12 '19 at 15:44
5

You can check the environment variables SLURM_STEP_GPUS or SLURM_JOB_GPUS for a given node:

echo ${SLURM_STEP_GPUS:-$SLURM_JOB_GPUS}

Note CUDA_VISIBLE_DEVICES may not correspond to the real value (see @isarandi's comment).

Also, note this should work for non-Nvidia GPUs as well.

bryant1410
  • 5,540
  • 4
  • 39
  • 40
3

Slurm stores this information in an environment variable, SLURM_JOB_GPUS.

One way to keep track of such information is to log all SLURM related variables when running a job, for example (following Kaldi's slurm.pl, which is a great script to wrap Slurm jobs) by including the following command within the script run by sbatch:

set | grep SLURM | while read line; do echo "# $line"; done
leilu
  • 367
  • 3
  • 10