1

I have a memory leak in my training pipeline and don't know how to fix it.

I use Tensorflow version: 1.9.0 and Keras (tf) version: 2.1.6-tf with Python 3.5.2

This is how my training pipeline looks like:

for i in range(num_epochs):

    training_data = training_set.make_one_shot_iterator().get_next()
    hist = model.fit(training_data[0],[training_data[1],training_data[2],training_data[3]],
                    steps_per_epoch=steps_per_epoch_train,epochs=1, verbose=1, callbacks=[history, MemoryCallback()])


    # custom validation

It looks like memory of the iterator is not freed after the iterator is exhausted. I have already tried del traininig_data after model.fit. It didn't work.

Can anybody give some hints?

Edit: This is how I create the dataset.

dataset = tf.data.TFRecordDataset(tfrecords_filename)
dataset = dataset.map(map_func=preprocess_fn, num_parallel_calls=8)
dataset = dataset.shuffle(100)
dataset = dataset.batch(batch_size=batch_size)
dataset = dataset.prefetch(1)
ninja
  • 397
  • 4
  • 16
  • Are you caching the dataset (trainin_set)? – O. Gindele Sep 25 '18 at 08:49
  • No, I don't cache the dataset. Would it help? – ninja Sep 25 '18 at 09:13
  • Caching might explain that the dataset is still in memory. How do you build the dataset? – O. Gindele Sep 25 '18 at 09:27
  • I made an edit to my post. – ninja Sep 25 '18 at 09:59
  • Possible duplicate of [Tensorflow runs out of memory while computing: how to find memory leaks?](https://stackoverflow.com/questions/51175837/tensorflow-runs-out-of-memory-while-computing-how-to-find-memory-leaks) – P-Gn Sep 25 '18 at 10:28
  • Thanks. The issue is similar. I think I am constantly adding nodes to the graph with the make_one_shot_iterator operation. But I don't know how to fix it. – ninja Sep 25 '18 at 11:29

1 Answers1

0

Including the repeat() method to reinitialize your iterator might solve your problem. You can take a look at Input Pipeline Performance Guide to figure out what would be the a good optimized order of your methods according to your requirements.

dataset = dataset.shuffle(100)
dataset = dataset.repeat() # Can specify num_epochs as input if needed
dataset = dataset.batch(batch_size=batch_size)
dataset = dataset.prefetch(1)

In case you can afford to do the validation as a part of the fit method, you can use something like the code below and lose the loop altogether to make your life easier.

training_data = training_set.make_one_shot_iterator().get_next()
# val_data refers to your validation data and steps_per_epochs_val refers to no of your validation batches
hist = model.fit(training_data[0],training_data[1],training_data[2],training_data[3]], validation_data=val_data.make_one_shot_iterator(), validation_steps=steps_per_epochs_val, 
       steps_per_epoch=steps_per_epoch_train, epochs=num_epochs, verbose=1, callbacks=[history, MemoryCallback()])

Reference: https://github.com/keras-team/keras/blob/master/examples/mnist_dataset_api.py

kvish
  • 992
  • 1
  • 8
  • 12