6

I can run a job on slurm with, for example, srun --gpus=2 and it will set CUDA_VISIBLE_DEVICES to the GPUs allocated. However I know of no such way to inspect which GPUs SLURM allocated a particular job. If I run scontrol show job it will show me something like TresPerJob=gpu:2 but it doesn't contain the actual GPUs allocated.

Where can I find this information? In other words, how can I look up which GPUs job n was allocated?

schmmd
  • 18,650
  • 16
  • 58
  • 102
  • Does this answer your question? [How to get the ID of GPU allocated to a SLURM job on a multiple GPUs node?](https://stackoverflow.com/questions/43967405/how-to-get-the-id-of-gpu-allocated-to-a-slurm-job-on-a-multiple-gpus-node) – bryant1410 Jan 13 '21 at 20:00

3 Answers3

10

scontrol show job -d can do this. The -d flag adds extra info to the output, one of which is a field like GRES=gpu(IDX:0-2).

midiarsi
  • 226
  • 2
  • 4
2

When you execute nvidia-smi command, you get somethign like this:

enter image description here

The "GPU" column is the ID of the GPU which usually matches the device in the system (ls /dev/nvidia*). This same identification is used by Slurm in CUDA_VISIBLE_DEVICES environment variable. So, when in this variable you see

0,1,2

means that the job has been assigned with the GPUs whose IDs are 0, 1 and 2.

Bub Espinja
  • 4,029
  • 2
  • 29
  • 46
2

If you're just looking for what slurm set CUDA_VISIBLE_DEVICES to, I'd suggest using cat /proc/12345/environ where the number is the PID of whatever slurm launched.

This is liable to be overridden, however, with something like srun --export=ALL bash -i, so you can't rely on it in the adversarial case.

Brendan
  • 36
  • 1
  • I wrote a script that does just this: https://gist.github.com/schmmd/1aa445be858ce560d48e13ef2041fea1 – schmmd Nov 20 '19 at 19:08