Keras model training memory leak

Question

I'm new with Keras, Tensorflow, Python and I'm trying to build a model for personal use/future learning. I've just started with python and I came up with this code (with help of videos and tutorials). My problem is that my memory usage of Python is slowly creeping up with each epoch and even after constructing new model. Once the memory is at 100% the training just stop with no error/warning. I don´t know too much but the issue should be somewhere within the loop (If I´m not mistaken). I know about

k.clear.session()

but either the issue was not removed or I don´t know how to integrate it in my code. I have: Python v 3.6.4, Tensorflow 2.0.0rc1 (cpu version), Keras 2.3.0

This is my code:

import pandas as pd
import os
import time
import tensorflow as tf
import numpy as np
import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, LSTM, BatchNormalization
from tensorflow.keras.callbacks import TensorBoard, ModelCheckpoint

EPOCHS = 25
BATCH_SIZE = 32           

df = pd.read_csv("EntryData.csv", names=['1SH5', '1SHA', '1SA5', '1SAA', '1WH5', '1WHA',
                                         '2SA5', '2SAA', '2SH5', '2SHA', '2WA5', '2WAA',
                                         '3R1', '3R2', '3R3', '3R4', '3R5', '3R6',
                                         'Target'])

df_val = 14554 

validation_df = df[df.index > df_val]
df = df[df.index <= df_val]

train_x = df.drop(columns=['Target'])
train_y = df[['Target']]
validation_x = validation_df.drop(columns=['Target'])
validation_y = validation_df[['Target']]

train_x = np.asarray(train_x)
train_y = np.asarray(train_y)
validation_x = np.asarray(validation_x)
validation_y = np.asarray(validation_y)

train_x = train_x.reshape(train_x.shape[0], 1, train_x.shape[1])
validation_x = validation_x.reshape(validation_x.shape[0], 1, validation_x.shape[1])

dense_layers = [0, 1, 2]
layer_sizes = [32, 64, 128]
conv_layers = [1, 2, 3]

for dense_layer in dense_layers:
    for layer_size in layer_sizes:
        for conv_layer in conv_layers:
            NAME = "{}-conv-{}-nodes-{}-dense-{}".format(conv_layer, layer_size, 
                    dense_layer, int(time.time()))
            tensorboard = TensorBoard(log_dir="logs\{}".format(NAME))
            print(NAME)

            model = Sequential()
            model.add(LSTM(layer_size, input_shape=(train_x.shape[1:]), 
                                       return_sequences=True))
            model.add(Dropout(0.2))
            model.add(BatchNormalization())

            for l in range(conv_layer-1):
                model.add(LSTM(layer_size, return_sequences=True))
                model.add(Dropout(0.1))
                model.add(BatchNormalization())

            for l in range(dense_layer):
                model.add(Dense(layer_size, activation='relu'))
                model.add(Dropout(0.2))

            model.add(Dense(2, activation='softmax'))

            opt = tf.keras.optimizers.Adam(lr=0.001, decay=1e-6)

            # Compile model
            model.compile(loss='sparse_categorical_crossentropy',
                          optimizer=opt,
                          metrics=['accuracy'])

            # unique file name that will include the epoch 
            # and the validation acc for that epoch
            filepath = "RNN_Final.{epoch:02d}-{val_accuracy:.3f}"  
            checkpoint = ModelCheckpoint("models\{}.model".format(filepath, 
                         monitor='val_acc', verbose=0, save_best_only=True, 
                         mode='max')) # saves only the best ones

            # Train model
            history = model.fit(
                train_x, train_y,
                batch_size=BATCH_SIZE,
                epochs=EPOCHS,
                validation_data=(validation_x, validation_y),
                callbacks=[tensorboard, checkpoint])

# Score model
score = model.evaluate(validation_x, validation_y, verbose=2)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
# Save model
model.save("models\{}".format(NAME))

Also I don´t know If it´s possible to ask 2 problems within 1 question (I don´t want to spam it here with my problems which anyone with any python experience can resolve within a minute), but I also have problem with checkpoint saving. I want to save only the best performing model (1 model per 1 NN specification - number of nodes/layers) but currently it is saved after every epoch. If this is inappropriate to ask I can create another question for this.

Thank you very much for any help.

My answer is my best guess to the source of the problem based on the code you've provided - there could be other causes; let me know if below solves the memory problem — OverLordGoldDragon, Sep 27 '19 at 16:18
I faced a similar issue while training different models in a same script. I collected some possible fixes and workarounds here : [memory leak with Keras](https://www.thekerneltrip.com/python/keras-memory-leak/) — RUser4512, Oct 01 '21 at 08:33

OverLordGoldDragon · Answer 1 · 2019-11-14T02:43:22.580

9

One source of the problem is, a new loop of model = Sequential() does not remove the previous model; it remains built within its TensorFlow graph scope, and every new model = Sequential() adds another lingering construction which eventually overflows memory. To ensure a model is properly destroyed in full, run below once you're done with a model:

import gc
del model
gc.collect()
K.clear_session()
tf.compat.v1.reset_default_graph() # TF graph isn't same as Keras graph

gc is Python's garbage collection module, which clears remnant traces of model after del. K.clear_session() is the main call, and clears the TensorFlow graph.

Also, while your idea for model checkpointing, logging, and hyperparameter search is quite sound, it's quite faultily executed; you will actually be testing only one hyperparameter combination for the entire nested loop you've set up there. But this should be asked in a separate question.

UPDATE: just encountered the same problem, on a fully properly setup environment; the likeliest conclusion is, it's a bug - and a definite culprit is Eager execution. To work around, use

tf.compat.v1.disable_eager_execution() # right after `import tensorflow as tf`

to switch to Graph mode, which can also run significantly faster. Also see updated clear code above.

edited Nov 14 '19 at 02:43

answered Sep 27 '19 at 16:18

OverLordGoldDragon

1
9
53
101

Do you know where should put it in my code? its either not working or it is giving me that spacing/tabbing is wrong. – Sly Shark Sep 27 '19 at 16:52
@SlyShark I'd actually double-check your tabbing in the posted question, as it's quite off - in particular, everything beneath `#Compile model` should be unindented and outside the for-loops. Then, call the `del model` etc after `model.save(...)` – OverLordGoldDragon Sep 27 '19 at 16:55
@SlyShark For any variable you load data into or that otherwise uses `var = ` during the loop, run `print(len(var))` (or `print(len(var['some_key']))` for dictionaries) at the end of each epoch and see if any variable grows unexpectedly. For a step further, apply [pympler](https://stackoverflow.com/questions/5022725/how-do-i-measure-the-memory-usage-of-an-object-in-python) instead -- the idea is to root out any memory leaks, so keep an eye out on all variable loads/assignments – OverLordGoldDragon Sep 28 '19 at 18:45
Also I know about that I am currently testing only one hyperparameter for the entire model. But, as you can see, currently I can't test more than few models before I run out of memory. – Sly Shark Sep 28 '19 at 18:47
@SlyShark As a tip, debugging becomes substantially easier if you clean up your code - for example. write a `make_model()` function that returns a compiled model and takes hyperparameters as arguments, _then_ put `make_model` in a loop iterating over the hyperparams – OverLordGoldDragon Sep 28 '19 at 18:48
@SlyShark I just encountered the same problem; I have a fully proper environment setup - thus, the likeliest conclusion is, it's a bug. See updated answer for a workaround (that worked for me) – OverLordGoldDragon Nov 14 '19 at 02:43

score -3 · Answer 2 · answered Jan 16 '20 at 17:02

-3

This is a known bug. Updating to Tensorflow 2.1 should fix the issue.

answered Jan 16 '20 at 17:02

Baguette

159
1
13

Keras model training memory leak

2 Answers2

Linked