I am running FastRCNN w/ a ResNet50 architecture. I load the model checkpoint and do inference like this:
saver = tf.train.Saver()
saver.restore(sess, 'model/model.ckpt')
with tf.Session() as sess:
sess.run(y_pred, feed_dict={x: input_data})
Everything seems to be working great. The model takes 0.08s to actually perform inference.
But, I noticed that when I do this my GPU memory usage explodes to 15637MiB / 16280MiB
according to nvidia-smi
.
I found that you could use the option config.gpu_options.allow_growth
to stop Tensorflow from allocating the entire GPU and to instead use GPU memory as needed:
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
saver = tf.train.Saver()
saver.restore(sess, 'model/model.ckpt')
with tf.Session(config=config) as sess:
sess.run(y_pred, feed_dict={x: input_data})
Doing this decrease memory usage down to 4875MiB / 16280MiB
. The model still takes 0.08s to run.
Finally, I did this below, where I allocate a fixed amount of memory using per_process_gpu_memory_fraction
.
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.05
saver = tf.train.Saver()
saver.restore(sess, 'model/model.ckpt')
with tf.Session(config=config) as sess:
sess.run(y_pred, feed_dict={x: input_data})
Doing this brings usage down to 1331MiB / 16280MiB
and the model still takes 0.08s to run.
This begs the question - how is TF allocating memory for models upon inference? If I want to load this model 10 times on the same GPU to perform inference in parallel, will that be an issue?