Training gets progressively slower over time

Question

This is the first time I'm experiencing this issue. I've been using this model for a while, but with less data. The problem is that in the first 3 epochs training took 11 sec/step (31k samples / 128 batch size) while in the 4-th epoch it took 18 sec/step. In the fifth it took about 45 sec/step. I'm using Keras and not doing any custom loop shenanigans.

Can someone explain this slowdown? The model hasn't been interrupted. I'm using TF 2.3

Epoch 1/1200
248/248 [==============================] - 2727s 11s/step - loss: 2.3481 - acc: 0.3818 - top3_acc: 0.5751 - recall: 0.2228 - precision: 0.6195 - f1: 0.3239 - val_loss: 0.9020 - val_acc: 0.8085 - val_top3_acc: 0.8956 - val_recall: 0.5677 - val_precision: 0.9793 - val_f1: 0.7179
Epoch 2/1200
248/248 [==============================] - 2712s 11s/step - loss: 1.0319 - acc: 0.7203 - top3_acc: 0.8615 - recall: 0.5489 - precision: 0.9245 - f1: 0.6865 - val_loss: 0.5547 - val_acc: 0.8708 - val_top3_acc: 0.9371 - val_recall: 0.7491 - val_precision: 0.9661 - val_f1: 0.8435
Epoch 3/1200
248/248 [==============================] - 4426s 18s/step - loss: 0.7094 - acc: 0.8093 - top3_acc: 0.9178 - recall: 0.6830 - precision: 0.9446 - f1: 0.7920 - val_loss: 0.4399 - val_acc: 0.8881 - val_top3_acc: 0.9567 - val_recall: 0.8140 - val_precision: 0.9606 - val_f1: 0.8808
Epoch 4/1200
 18/248 [=>............................] - ETA: 3:14:16 - loss: 0.6452 - acc: 0.8338 - top3_acc: 0.9223 - recall: 0.7257 - precision: 0.9536 - f1: 0.8240

Edit: I just ran this on a super small sample(20 items / category) of the data and the step time does not increase. proof

Edit 2: Model summary

Model: "functional_3"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_token (InputLayer)        [(None, 300)]        0                                            
__________________________________________________________________________________________________
masked_token (InputLayer)       multiple             0           input_token[0][0]                
__________________________________________________________________________________________________
tf_distil_bert_model (TFDistilB ((None, 300, 768),)  66362880    masked_token[1][0]               
__________________________________________________________________________________________________
tf_op_layer_strided_slice (Tens multiple             0           tf_distil_bert_model[1][0]       
__________________________________________________________________________________________________
efficientnetb5_input (InputLaye [(None, 456, 456, 3) 0                                            
__________________________________________________________________________________________________
batch_normalization (BatchNorma (None, 768)          3072        tf_op_layer_strided_slice[1][0]  
__________________________________________________________________________________________________
efficientnetb5 (Functional)     (None, 15, 15, 2048) 28513527    efficientnetb5_input[0][0]       
__________________________________________________________________________________________________
dense (Dense)                   (None, 256)          196864      batch_normalization[1][0]        
__________________________________________________________________________________________________
global_average_pooling2d (Globa (None, 2048)         0           efficientnetb5[1][0]             
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 140)          35980       dense[1][0]                      
__________________________________________________________________________________________________
dense_3 (Dense)                 (None, 140)          286860      global_average_pooling2d[1][0]   
__________________________________________________________________________________________________
concatenate (Concatenate)       (None, 280)          0           dense_1[1][0]                    
                                                                 dense_3[1][0]                    
__________________________________________________________________________________________________
dense_4 (Dense)                 (None, 100)          28100       concatenate[0][0]                
__________________________________________________________________________________________________
dropout_20 (Dropout)            (None, 100)          0           dense_4[0][0]                    
__________________________________________________________________________________________________
dense_5 (Dense)                 (None, 20)           2020        dropout_20[0][0]                 
==================================================================================================
Total params: 95,429,303
Trainable params: 30,120
Non-trainable params: 95,399,183

Are you using callbacks? The network's kernel matrix propagation during training would always be identical, so it has to be some process you've injected. — Ryan Cocuzzo, Aug 24 '20 at 23:55
@RyanCocuzzo I have the early stopping callback only. Do you think that could cause this? — Christo S. Christov, Aug 24 '20 at 23:58
Are you doing reinforcement learning? If your updates are based on doing a certain number of episodes, then at the beginning of training, when episodes tend to be short, the updates can be fast. — Taw, Aug 25 '20 at 00:04
@Taw No, this is multi modal learning. I'm merging the outputs of two pretrained models and learning from that. These two models are frozen during training and only a single dense layer is trained. First model - CNN, second model - RNN — Christo S. Christov, Aug 25 '20 at 00:05
I believe you have some sort of errored callback because, between constant-process loops, time is increasing. — Ryan Cocuzzo, Aug 25 '20 at 00:07
@RyanCocuzzo That's odd because the only one I'm using is EarlyStopping from the keras framework and I'm not seeing any errors on screen — Christo S. Christov, Aug 25 '20 at 00:08
Try training without it, but it could only otherwise be some deeply-technical edge cases during training. I could toss out ideas, but they would be guesses at best and very case-specific. — Ryan Cocuzzo, Aug 25 '20 at 00:10
@RyanCocuzzo Interestingly enough training on a small sample of the data works quite well without any slowdown (no slow is seen for 50 epochs now). I think it has something to do with the amount of items being inputted. Anyway, I will try removing the early stopping as well but as you might guess, this will take a while :) — Christo S. Christov, Aug 25 '20 at 00:12
Yea, if it's not the callback and we're assuming all the keras code works without error, it would be some highly technical issue with maybe an exploding gradient increasing computation time exponentially, or "dead" neurons revitalizing throughout the training at an exponential rate, etc. These are unlikely and I hope they're not relevant. — Ryan Cocuzzo, Aug 25 '20 at 00:18
could you provide the model summary as part of the question? — Akshay Sehgal, Aug 25 '20 at 00:24
I bring up "dead" neurons because it's possible in the early stages of training, your network propagation is not reaching the whole network, depending on your data & activation functions. — Ryan Cocuzzo, Aug 25 '20 at 00:24
Also, if possible can you post the code for the data loader? — Akshay Sehgal, Aug 25 '20 at 00:27
@AkshaySehgal Added the model summary. I don't really have a data loader. I just add all the data into memory and pass it as np arrays to keras. — Christo S. Christov, Aug 25 '20 at 00:29
@RyanCocuzzo hmmm i have no idea how to check that, my guess is something is accumulating after each epoch that shouldn't be accumulating and it causes exponential slowdown — Christo S. Christov, Aug 25 '20 at 00:33
I think you are running out of memory due to a leak. This is causing some issues. Here is a sample of that - https://datascience.stackexchange.com/questions/61224/running-out-of-memory-when-training-keras-lstm-model-for-binary-classification-o Notice in the post, the epoch times progressively increase for OP's code just like yours. 16s, 18s, 21s ... — Akshay Sehgal, Aug 25 '20 at 00:36

Akshay Sehgal · Accepted Answer · 2020-08-25T01:26:29.160

Symptoms:

This seems to be a memory issue due to a leak. First, you are able to run the model in constant epoch time for a small batch BUT with complete data, the epoch times increase progressively (with increasing time/step too!). I am assuming that as you run out of memory, it is causing increase in epoch times due to limited resources. Upon some amount of web searching, it seems that others who have had memory leaks in keras have had similar 'symptoms' w.r.t epoch times.

Check this link for example -

POST TITLE - "Running out of memory when training Keras LSTM model for binary classification on image sequences"

Using TensorFlow backend.
Epoch 1/60
1/1 [==============================] - 16s 16s/step - loss: 0.7258 - acc: 0.5400 - val_loss: 0.7119 - val_acc: 0.6200
Epoch 2/60
1/1 [==============================] - 18s 18s/step - loss: 0.7301 - acc: 0.4800 - val_loss: 0.7445 - val_acc: 0.4000
Epoch 3/60
1/1 [==============================] - 21s 21s/step - loss: 0.7312 - acc: 0.4200 - val_loss: 0.7411 - val_acc: 0.4200
(...training continues...)

Notice the progressively increasing epoch times?

Diagnosis:

One way you can check your memory usage (without using tensorboard) is by using callbacks. Here is a dummy code I made that could assist you in getting the memory usage after each epoch using a callback.

import numpy as np
import tensorflow.keras as keras
import resource

class MemoryCallback(keras.callbacks.Callback):
    def on_epoch_end(self, epoch, log={}):
        print(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)

def build_model(shape):
    f_input = keras.layers.Input(shape=(shape[1],))  # (100,)
    d1 = keras.layers.Dense(50, activation='tanh')(f_input)
    d1 = keras.layers.Dense(50, activation='tanh')(d1)
    softmax = keras.layers.Dense(10, activation='softmax')(d1)
    return keras.Model(f_input, softmax)

data = np.random.random((1000, 100))
model = build_model(data.shape)
model.compile(loss='mse', optimizer='SGD')
model.fit(x=data, y=np.random.random((1000,)), verbose=0, epochs=10, callbacks=[MemoryCallback()])

Make sure to set verbose=0 here. Make sure you restart your kernel / python IDE and then run this so that memory is cleared before checking. I am assuming you should see this number increase progressively.

If you are on Mac or Linux, you can also use HTOP to see your memory usage while the model is running. You should see the memory cap to max. You can install HTOP with brew install on mac or sudo apt-get for linux.

Solution:

A solution to this problem is suggested in this stackoverflow post, that helps limit the memory that tf uses.

import tensorflow as tf
from keras.backend.tensorflow_backend import set_session

config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.9 # fraction of memory
config.gpu_options.visible_device_list = "0"

set_session(tf.Session(config=config))

Hi, I will try out the profile tomorrow first thing in the morning :) thanks for answering. Also I took a note of the link to the question asked and I will review it — Christo S. Christov, Aug 25 '20 at 00:43

Training gets progressively slower over time

1 Answers1