Tensorflow: How do you monitor GPU performance during model training in real-time?

Question

I am new to Ubuntu and GPUs and have recently been using a new PC with Ubuntu 16.04 and 4 NVIDIA 1080ti GPUs in our lab. The machine also has an i7 16 core processor.

I have some basic questions:

Tensorflow is installed for GPU. I presume then, that it automatically prioritises GPU usage? If so, does it use all 4 together or does it use 1 and then recruit another if needed?
Can I monitor in real-time, the GPU use/activity during training of a model?

I fully understand this is basic hardware stuff but clear definitive answers to these specific questions would be great.

EDIT:

Based on this output - it this really saying that nearly all the memory on each one of my GPUs is being used?

Also in my experience tensorflow my default grabs all memory on all GPUs. To avoid this, I set the option `gpu_options.allow_growth` of the session configuration to `True`. See also https://stackoverflow.com/questions/34199233 . — Andre Holzner, Jan 07 '18 at 18:53

score 19 · Accepted Answer · edited Jun 11 '20 at 00:49

19

Tensorflow automatically doesn't utilize all GPUs, it will use only one GPU, specifically first gpu /gpu:0

You have to write multi gpus code to utilize all gpus available. cifar mutli-gpu example
to check usage every 0.1 seconds

watch -n0.1 nvidia-smi

edited Jun 11 '20 at 00:49

n.gaurav

55
10

answered Aug 07 '17 at 10:28

Ishant Mrinal

4,898
3
29
47

`nvidia-smi` already has the option to refresh using the `-l` or `-lms` flag, so `watch -n0.1 nvidia-smi` is equavilent to `nvidia-smi -lms 100` Just for the readers, not saying it's better. – AnotherOne Jun 04 '22 at 15:14

score 5 · Answer 2 · answered May 28 '20 at 19:44

5

Try this command:

nvidia-smi --query-gpu=utilization.gpu --format=csv --loop=1

Here is a demo:

answered May 28 '20 at 19:44

singrium

2,746
5
32
45

score 4 · Answer 3 · answered Aug 07 '17 at 10:28

If no other indication is given, a GPU-enabled TensorFlow installation will default to use the first available GPU (as long as you have the Nvidia driver and CUDA 8.0 installed and the GPU has the necessary compute capability, which, according to the docs is 3.0). If you want to use more GPUs, you need to use tf.device directives in your graph (more about it here).
The easiest way to check the GPU usage is the console tool nvidia-smi. However, unlike top or other similar programs, it only shows the current usage and finishes. As suggested in the comments, you can use something like watch -n1 nvidia-smi to re-run the program continuously (in this case every second).

score 4 · Answer 4 · answered Dec 17 '20 at 22:38

4

I would suggest nvtop, it shows real-time status and easier to watch than nvidia-smi. It also shows in a graph.

$ sudo apt install nvtop
$ nvtop

answered Dec 17 '20 at 22:38

Zstack

4,046
1
19
22

2

I like this one, very useful. – Innat Aug 05 '21 at 21:20

score 1 · Answer 5 · answered Feb 28 '19 at 01:16

1

All the above commands use watch, it's much more efficient to keep the context alive by using the builin looper: nvidia-smi -l 1.

If you want to see something like htop and nvidia-smi at the same time, you can try glances (pip install glances).

answered Feb 28 '19 at 01:16

TimZaman

2,689
2
26
36

gogasca · Answer 6 · 2019-03-28T05:32:42.073

0

If you are using GCP, please take a look at this script which allows you to monitor GPU utilization in StackDriver, you can also use it to collect nvidia-smi data using nvidia-smi -l 5 command and reporting those statistics for you to track.

https://github.com/GoogleCloudPlatform/ml-on-gcp/tree/master/dlvm/gcp-gpu-utilization-metrics

edited Mar 28 '19 at 05:32

answered Mar 17 '19 at 00:25

gogasca

9,283
6
80
125

score 0 · Answer 7 · answered May 21 '20 at 20:31

You should use nvidia-smi. Just keep in mind that depending on your workload you might not see any change in the load if the task completes between 2 sampling events.

Also keep in mind that the maximum sampling interval is 1/6 second as per: http://manpages.org/nvidia-smi

Utilization rates report how busy each GPU is over time, and can be used to determine how much an application is using the GPUs in the system. Note: During driver initialization when ECC is enabled one can see high GPU and Memory Utilization readings. This is caused by ECC Memory Scrubbing mechanism that is performed during driver initialization.

GPU Percent of time over the past sample period during which one or more kernels was executing on the GPU. The sample period may be between 1 second and 1/6 second depending on the product.

Memory Percent of time over the past sample period during which global (device) memory was being read or written. The sample period may be between 1 second and 1/6 second depending on the product.

Tensorflow: How do you monitor GPU performance during model training in real-time?

7 Answers7

Linked