11

I am working on a tensorflow model which takes pretty much RAM. It is executed iteratively to process given tasks.

However, with increasing time the whole process starts consuming more and more RAM although it should clean it up. This sounds like as if I'd keep data of one graph over the iterations, but I am almost sure that the graphs are cleanly separated.

Problem

I reduced the code to the following:

import tensorflow as tf
import numpy as np

reps = 30
for i in range(reps):
    with tf.Graph().as_default() as graph:
        with tf.Session(graph=graph) as sess:
            tf.constant(np.random.random((1000,1000,200,1)))

I have 32GB RAM available, working on a ubuntu 17.04 with CPU Tensorflow 1.3. This will give following error message after about the 25th or 27th iteration:

terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc

Giving the process some time after each iteration results in no improvement:

import tensorflow as tf
import numpy as np
import time

reps = 30
for i in range(reps):
    with tf.Graph().as_default() as graph:
        with tf.Session(graph=graph) as sess:
            tf.constant(np.random.random((1000,1000,200,1)))
    time.sleep(1)

However, it works if I force garbage collection invocation after each repetition:

import tensorflow as tf
import numpy as np
import gc

reps = 30
for i in range(reps):
    with tf.Graph().as_default() as graph:
        with tf.Session(graph=graph) as sess:
            tf.constant(np.random.random((1000,1000,200,1)))
    gc.collect()

Question

Now I wonder why I need to force garbage collection to run even though tensorflow should have closed the session and de-referenced the graph object.

Back to my original model I am not sure, yet, if the gc invocation actually helps. The memory usage grows pretty intense, especially when I am about to persist the model to disk.

Are there any best practices on how to iteratively work with large models? Is this an actual memory issue?

Thanks for any insights.

bouteillebleu
  • 2,456
  • 23
  • 32
jjs
  • 111
  • 1
  • 5
  • Related: https://stackoverflow.com/questions/63411142/how-to-avoid-oom-errors-in-repeated-training-and-prediction-in-tensorflow (even `gc.collect()` does not always help). – bers Oct 01 '20 at 13:35