Tensorflow: GPU util big difference when setting CUDA_VISIBLE_DIVICES to different values

Question

Linux: Ubuntu 16.04.3 LTS (GNU/Linux 4.10.0-38-generic x86_64)

Tensorflow: compile from source, 1.4

GPU: 4xP100

I am trying the new released object detection tutorial training program. I noticed that there is big difference when I set CUDA_VISIBLE_DEVICES to different value. Specifically, when it is set to "gpu:0", the gpu util is quite high like 80%-90%, but when I set it to other gpu devices, such as gpu:1, gpu:2 etc. The gpu util is very low between 10%-30%.

As for the training speed, it seems to be roughly the same, much faster than that when using CPU only.

I just curious how this happens.

score 0 · Answer 1 · answered Jun 05 '18 at 20:32

As this answer mentions GPU-Util is a measure of usage/business of the computation of each GPU.

I'm not an expert, but from my experience GPU 0 is generally where most of your processes run by default. CUDA_VISIBLE_DEVICES sets the GPUs seen by the processes you run on that bash. Therefore, by setting CUDA_VISIBLE_DEVICES to gpu:1/2 you are making it to run on less busy GPUs.

Moreover, you only reported 1 value, in theory you should have one per GPU; there is the possibility you were only looking at GPU-util for GPU-0 which would of course decrease if you are not using.

Tensorflow: GPU util big difference when setting CUDA_VISIBLE_DIVICES to different values

1 Answers1