Keras occupies an indefinitely increasing amount of memory for each epoch

Question

I'm running a genetic hyperparameter search algorithm and it quickly saturates all available memory.

After a few tests it looks like the amount of memory required by keras increases both between different epochs and when training different models. The problem becomes a lot worse as the minibatch size increases, a minibatch size of 1~5 at least gives me enough time to see the memory usage rise up really fast in the first few fits and then slowly but steadily keep increasing over time.

I already checked keras predict memory swap increase indefinitely, Keras: Out of memory when doing hyper parameter grid search, and Keras (TensorFlow, CPU): Training Sequential models in loop eats memory, so I am already clearing keras session and resetting tensorflow's graph after each iteration.

I also tried explicitly deleting the model and history object and running gc.collect() but to no avail.

Im running Keras 2.2.4, tensorflow 1.12.0, Python 3.7.0 on CPU. The code I'm running for each gene and the callback I'm using to measure the memory usage:

import tensorflow as tf
import keras as K

class MemoryCallback(K.callbacks.Callback):
    def on_epoch_end(self, epoch, log={}):
        print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)


def Rateme(self,loss,classnum,patience,epochs,DWIshape,Mapshape,lr,TRAINDATA,TESTDATA,TrueTrain, TrueTest,ModelBuilder,maxthreads):

K.backend.set_session(K.backend.tf.Session(config=K.backend.tf.ConfigProto(intra_op_parallelism_threads=maxthreads, inter_op_parallelism_threads=maxthreads)))

#Early Stopping
STOP=K.callbacks.EarlyStopping(monitor='val_acc', min_delta=0.001,
                               patience=patience, verbose=0, mode='max')
#Build model
Model=ModelBuilder(DWIshape, Mapshape, dropout=self.Dropout,
                      regularization=self.Regularization,
                      activ='relu', DWIconv=self.nDWI, DWIsize=self.sDWI,
                      classes=classnum, layers=self.nCNN,
                      filtersize=self.sCNN,
                      FClayers=self.FCL, last=self.Last)
#Compile
Model.compile(optimizer=K.optimizers.Adam(lr,decay=self.Decay), loss=loss, metrics=['accuracy'])
#Fit
his=Model.fit(x=TRAINDATA,y=TrueTrain,epochs=epochs,batch_size=5, shuffle=True, validation_data=(TESTDATA,TrueTest), verbose=0, callbacks=[STOP, MemoryCallback()]) #check verbose and callbacks
#Extract 
S=Model.evaluate(x=TESTDATA, y=TrueTest,verbose=1)[1]
del his
del Model
del rateme
K.backend.clear_session()
tf.reset_default_graph()
gc.collect()

return S

paweller · Accepted Answer · 2021-06-16T06:56:29.203

Since the memory leak still seems to be present in TensorFlow 2.4.1 when using the built-in functions like model.fit() here is my take on it.

Issues

Loads of RAM usage even though I am running NVIDIA GeForce RTX 2080 TI GPUs.
Increasing epoch times as training progresses.
Some kind of memory leakage (feels like it was somewhat linear).

Solutions

Add the run_eagerly=True argument to the model.compile() function. However, doing so might result in TensorFlow's graph optimization to not work anymore which could lead to a decreased performance (reference).
Create a custom callback that garbage collects and clears the Keras backend at the end of each epoch (reference).
Do not use the activation parameter inside the tf.keras.layers. Put the activation function as a seperate layer (reference).
Use LeakyReLU instead of ReLU as the activation function (reference).

Note: Since all the bullet points can be implemented individually you can mix and match them until you get a result that works for you. Anyways, here is a code snippet showing the solutions all together:

import gc
from tensorflow.keras import backend as k
from tensorflow.keras.layers import Conv2D, BatchNormalization, ReLU
from tensorflow.keras.callbacks import Callback


class CovNet:
    ...
    x = Conv2d(
        ...,
        activation=None
    )(x)
    x = BatchNormalization()(x)
    x = ReLU()(x)  # or LeakyReLU
    ...

#--------------------------------------------------------------------------------

class ClearMemory(Callback):
    def on_epoch_end(self, epoch, logs=None):
        gc.collect()
        k.clear_session()

#--------------------------------------------------------------------------------

model.compile(
    ...,
    run_eagerly=True
)

#--------------------------------------------------------------------------------

model.fit(
    ...,
    callbacks=ClearMemory()
)

With these solutions I am now able to train with less RAM being occupied, epoch times stay constant and if there still is memory leakage it is negligible.

Thanks to @Hongtao Yang for providing the link to one of the related GitHub issues and to rschiewer over at GitHub for his comment.

Notes

If none of the above works for you, you might want to try writing your own training loop in TensorFlow. Here is a guide on how to do it.
People have also been reporting that using tcmalloc instead of the default malloc allocater alleviated the memory leakage to some degree. For references see here or here.

I hope this might help others too and save you some bugging hours of research on the internet.

Reza Behzadpour · Answer 2 · 2018-12-08T17:18:18.157

2

Consuming the entire available memory is the default behavior of TF.
You can restrict the amount of memory consumption in TF using following code:

import tensorflow as tf
from keras.backend.tensorflow_backend import set_session

config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.9 # fraction of memory
config.gpu_options.visible_device_list = "0"

set_session(tf.Session(config=config))

edited Dec 08 '18 at 17:18

answered Dec 08 '18 at 17:13

Reza Behzadpour

638
5
16

4

This doesn't seem to work, probably because as I said I'm working on CPU. It just keeps eating up all of my RAM and moves on to swap space. – Hierakonpolis Dec 10 '18 at 15:39

score 2 · Answer 3 · answered Jan 22 '19 at 13:37

2

In the end I just restarted the python session between each training sessions with a bash script, couldn't find a better way to avoid an exploding memory footprint

answered Jan 22 '19 at 13:37

Hierakonpolis

194
1
10

This possibly won't work if I'm training a model and the weights need to get updated every iteration based on previous weights? – momo Jul 29 '20 at 02:45

score 0 · Answer 4 · answered Aug 19 '19 at 03:07

0

Maybe this is a related issue? If so, you will be fine when using a custom training loop instead of model.fit method.

I don't think they have addressed this issue yet, so I would avoid using built-in training/evaluation/prediction methods.

answered Aug 19 '19 at 03:07

Hongtao Yang

381
3
14

Keras occupies an indefinitely increasing amount of memory for each epoch

4 Answers4

Linked