12

I am new to Ubuntu and GPUs and have recently been using a new PC with Ubuntu 16.04 and 4 NVIDIA 1080ti GPUs in our lab. The machine also has an i7 16 core processor.

I have some basic questions:

  1. Tensorflow is installed for GPU. I presume then, that it automatically prioritises GPU usage? If so, does it use all 4 together or does it use 1 and then recruit another if needed?

  2. Can I monitor in real-time, the GPU use/activity during training of a model?

I fully understand this is basic hardware stuff but clear definitive answers to these specific questions would be great.

EDIT:

Based on this output - it this really saying that nearly all the memory on each one of my GPUs is being used?

enter image description here

GhostRider
  • 2,109
  • 7
  • 35
  • 53

7 Answers7

19
  1. Tensorflow automatically doesn't utilize all GPUs, it will use only one GPU, specifically first gpu /gpu:0

    You have to write multi gpus code to utilize all gpus available. cifar mutli-gpu example

  2. to check usage every 0.1 seconds

    watch -n0.1 nvidia-smi

n.gaurav
  • 55
  • 10
Ishant Mrinal
  • 4,898
  • 3
  • 29
  • 47
  • `nvidia-smi` already has the option to refresh using the `-l` or `-lms` flag, so `watch -n0.1 nvidia-smi` is equavilent to `nvidia-smi -lms 100` Just for the readers, not saying it's better. – AnotherOne Jun 04 '22 at 15:14
5

Try this command:

nvidia-smi --query-gpu=utilization.gpu --format=csv --loop=1

Here is a demo:

enter image description here

singrium
  • 2,746
  • 5
  • 32
  • 45
4
  1. If no other indication is given, a GPU-enabled TensorFlow installation will default to use the first available GPU (as long as you have the Nvidia driver and CUDA 8.0 installed and the GPU has the necessary compute capability, which, according to the docs is 3.0). If you want to use more GPUs, you need to use tf.device directives in your graph (more about it here).
  2. The easiest way to check the GPU usage is the console tool nvidia-smi. However, unlike top or other similar programs, it only shows the current usage and finishes. As suggested in the comments, you can use something like watch -n1 nvidia-smi to re-run the program continuously (in this case every second).
jdehesa
  • 58,456
  • 7
  • 77
  • 121
4

I would suggest nvtop, it shows real-time status and easier to watch than nvidia-smi. It also shows in a graph.

$ sudo apt install nvtop
$ nvtop

enter image description here

Zstack
  • 4,046
  • 1
  • 19
  • 22
1

All the above commands use watch, it's much more efficient to keep the context alive by using the builin looper: nvidia-smi -l 1.

If you want to see something like htop and nvidia-smi at the same time, you can try glances (pip install glances).

TimZaman
  • 2,689
  • 2
  • 26
  • 36
0

If you are using GCP, please take a look at this script which allows you to monitor GPU utilization in StackDriver, you can also use it to collect nvidia-smi data using nvidia-smi -l 5 command and reporting those statistics for you to track.

https://github.com/GoogleCloudPlatform/ml-on-gcp/tree/master/dlvm/gcp-gpu-utilization-metrics

enter image description here

gogasca
  • 9,283
  • 6
  • 80
  • 125
0

You should use nvidia-smi. Just keep in mind that depending on your workload you might not see any change in the load if the task completes between 2 sampling events.

Also keep in mind that the maximum sampling interval is 1/6 second as per: http://manpages.org/nvidia-smi

Utilization rates report how busy each GPU is over time, and can be used to determine how much an application is using the GPUs in the system. Note: During driver initialization when ECC is enabled one can see high GPU and Memory Utilization readings. This is caused by ECC Memory Scrubbing mechanism that is performed during driver initialization.

GPU Percent of time over the past sample period during which one or more kernels was executing on the GPU. The sample period may be between 1 second and 1/6 second depending on the product.

Memory Percent of time over the past sample period during which global (device) memory was being read or written. The sample period may be between 1 second and 1/6 second depending on the product.

QuantumLicht
  • 2,103
  • 3
  • 23
  • 32