6

I'm trying to use tensorflow profile to measure the detail GPU memory usage such as conv1 activations, weights, etc. I tried to use TF profile. It reported 4000MB peak usage. But at the same time, I measured using nvidia-smi, which reported 10000MB usage. They have big difference and I don't know what's the root cause. Can anyone give some suggestions to proceed?

TF profile:

enter image description here

nvidia-smi:

enter image description here

Tensorflow version: 1.9.0

user5473110
  • 125
  • 1
  • 9

1 Answers1

5

First, TF would always allocate most if not all available GPU memory when it starts. It actually allows TF to use memory more effectively. To change this behavior one might want to set an environment flag export TF_FORCE_GPU_ALLOW_GROWTH=true. More options are available here.

Once you've done that, nvidia-smi would still report exaggerated memory usage numbers. Because TF nvidia-smi reports allocated memory, while profiler reports actual peak memory being in use.

BFC is used as memory allocator. Whenever TF runs out of, say, 4GB of memory it would allocate twice the amount of 8GB. Next time it would try to allocate 16GB. At the same time the program might only use 9GB of memory on pick, but 16GB allocation would be reported by nvidia-smi. Also, BFC is not the only thing that allocates GPU memory in tensorflow, so, it can actually use 9GB+something.

Another comment here would be, that tensorflow native tools for reporting memory usage were not particularly precise in the past. So, I would allow myself to say, that profiler might actually be somewhat underestimating peak memory usage.

Here is some info on memory management https://github.com/miglopst/cs263_spring2018/wiki/Memory-management-for-tensorflow

Another a bit advanced link for checking memory usage: https://github.com/yaroslavvb/memory_util

y.selivonchyk
  • 8,987
  • 8
  • 54
  • 77
  • Thanks for the info. I've already used the "allow_growth" parameter. But it seems there will be an OOM error when nvidia-smi's reported memory usage exceed its maximum while TF profile's report is still low. Does this mean that the OOM happens when the allocated memory exceed the maximum instead of used memory? – user5473110 Jan 03 '20 at 22:16
  • 1
    In the memory_util link you sent, I saw it has this sentence "Deprecated in favor of mem_util https://github.com/yaroslavvb/chain_constant_memory/blob/master/mem_util_test.py". Should I use that instead? – user5473110 Jan 03 '20 at 22:20
  • @user5473110 yes, probably. Also, since you are using TF1.9 I would not rely on profiler data. Maybe try `sess.run(tf.contrib.memory_stats.MaxBytesInUse())` (taken from here https://stackoverflow.com/questions/40190510/tensorflow-how-to-log-gpu-memory-vram-utilization) - I am curious if it is going to show anything different. – y.selivonchyk Jan 03 '20 at 23:32
  • Regarding OOM and BFC: it will allocate nearly maximum device memory: 10978/11443 in your case. Then, lets say at some point memory usage is 9777 and TF tries to place a tensor of size 3000, then BFC would try to allocate some unreasonable amount of memory like 8GB, would fail, and would report that it "Failed to allocate 8GB of memory while tensor X". Also, try to read into OOM output to see which tensors might take most of space on your GPU – y.selivonchyk Jan 03 '20 at 23:38
  • Thanks. I'll try sess.run(tf.contrib.memory_stats.MaxBytesInUse()). So which version of TF do you suggest to use if I want to use profiler data? Is there a possible way to estimate memory usage if I know what CNN architecture I'm using, image size, batch size etc? – user5473110 Jan 06 '20 at 21:45