How does TensorFlow allocate GPU memory when performing inference?

Question

I am running FastRCNN w/ a ResNet50 architecture. I load the model checkpoint and do inference like this:

saver = tf.train.Saver()
saver.restore(sess, 'model/model.ckpt')
with tf.Session() as sess:
    sess.run(y_pred, feed_dict={x: input_data})

Everything seems to be working great. The model takes 0.08s to actually perform inference.

But, I noticed that when I do this my GPU memory usage explodes to 15637MiB / 16280MiB according to nvidia-smi.

I found that you could use the option config.gpu_options.allow_growth to stop Tensorflow from allocating the entire GPU and to instead use GPU memory as needed:

config = tf.ConfigProto()
config.gpu_options.allow_growth = True

saver = tf.train.Saver()
saver.restore(sess, 'model/model.ckpt')
with tf.Session(config=config) as sess:
    sess.run(y_pred, feed_dict={x: input_data})

Doing this decrease memory usage down to 4875MiB / 16280MiB. The model still takes 0.08s to run.

Finally, I did this below, where I allocate a fixed amount of memory using per_process_gpu_memory_fraction.

config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.05

saver = tf.train.Saver()
saver.restore(sess, 'model/model.ckpt')
with tf.Session(config=config) as sess:
    sess.run(y_pred, feed_dict={x: input_data})

Doing this brings usage down to 1331MiB / 16280MiB and the model still takes 0.08s to run.

This begs the question - how is TF allocating memory for models upon inference? If I want to load this model 10 times on the same GPU to perform inference in parallel, will that be an issue?

The same way as always. When not using allow growth, it simply allocates all the memory, and uses as much as your model weights. It won't cause problems until your data can be fitted into memory — Sharky, Apr 08 '19 at 18:05
So why is it when I do `per_process_gpu_memory_fraction` and bring GPU memory down to `1331MiB / 16280MiB`, everything still runs properly and exactly the same as when I use `allow_growth`? When using `allow_growth` memory is at `4875MiB / 16280MiB`. — farza, Apr 08 '19 at 18:12
It's described pretty good here https://www.tensorflow.org/guide/using_gpu — Sharky, Apr 08 '19 at 18:18

score 2 · Answer 1 · answered Apr 08 '19 at 18:12

2

Let's ensure what happens in tf.Session(config=config) firstly.

It means use submit the default graph def to tensorflow runtime, the runtime then allocate GPU memory accordingly.

Then Tensorflow will allocate all GPU memory unless you limit it by setting per_process_gpu_memory_fraction. It will fail if cannot allocate the amount of memory unless .gpu_options.allow_growth = True, which tells TF try again to allocate less memory in case of failure, but the iteration always starts with all or fraction portion of GPU memory.

And if you have 10 sessions, each session requires less than 1/10 GPU memory, it should work.

answered Apr 08 '19 at 18:12

pinxue

1,736
12
17

Hi! Thanks for the answer. The only part that confuses me a bit is when I use `.gpu_options.allow_growth = True` GPU usage is at `4875MiB / 16280MiB` but when i use `per_process_gpu_memory_fraction=0.05` GPU usage is at `1331MiB / 16280MiB` and things still work perfectly. Does this mean the `allow_growth` allocates more GPU memory than needed? – farza Apr 08 '19 at 18:20
“always starts with all or fraction portion of GPU memory.” If it cannot get that amount of memory, it will try less, not just right size :-) The document doesn't say so, but per my observation ... – pinxue Apr 08 '19 at 18:24
1

Hmmm interesting! Seems we still don't have a super concrete answer. I'll continue researching and see if I come across anything. – farza Apr 08 '19 at 18:35
Corret me if you find any thing – pinxue Apr 08 '19 at 18:37

How does TensorFlow allocate GPU memory when performing inference?

1 Answers1