-1

I have 2 Keras models - GRU and LSTM - which I run on a Jupyter Notebook. Both have the same implementation other than what layer they use, of course - LSTM vs GRU. Here is my code:


# 1st Model - GRU

if run_gru:

    model_gru = Sequential()

    model_gru.add(CuDNNGRU(75, return_sequences=True, input_shape=(i1,i2)))
    model_gru.add(CuDNNGRU(units=30, return_sequences=True))
    model_gru.add(CuDNNGRU(units=30))
    model_gru.add(Dense(units=1, activation="sigmoid"))
    model_gru.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # optimizer="adam"

    history_gru = model_gru.fit(x, y, epochs = 200, batch_size = 64, validation_data = (x2, y2), shuffle = False, callbacks = [EarlyStopping(patience=100, restore_best_weights=True)])


# 2nd Model - LSTM

if run_lstm:

    model_lstm = Sequential()

    model_lstm.add(CuDNNLSTM(75, return_sequences=True, input_shape=(i1,i2)))
    model_lstm.add(CuDNNLSTM(units=30, return_sequences=True))
    model_lstm.add(CuDNNLSTM(units=30))
    model_lstm.add(Dense(units=1, activation="sigmoid"))
    model_lstm.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # optimizer="adam"

    history_lstm = model_lstm.fit(x, y, epochs = 200, batch_size = 64, validation_data = (x2, y2), shuffle = False, callbacks = [EarlyStopping(patience=100, restore_best_weights=True)])

Here are my results when I run each model separately (i.e. restarting the kernel after each run):

  • run_gru = True; run_lstm = False -> GRU's val_acc = 58.13953%
  • run_gru = False; run_lstm = True -> LSTM's val_acc = 51.16279%

However, if I run LSTM immediately after GRU during the same kernel run (i.e. run both without restarting), my results are now as follows:

  • run_gru = True; run_lstm = True -> GRU's val_acc = 58.13953% (same as before) but LSTM's val_acc = 79.06977% (way better)

I am wondering if anyone has a guess on why the 2nd model (LSTM) now has way better accuracy, even though both are separate models.

I suspect that the 2nd model is stealing results from the 1st, so I went to check the loss for both models at epoch 1. I found that for each model, the loss for epoch 1 are the same, which implies that LSTM isn't stealing the weights/results from the 1st model (GRU). Also, the summary of the 2nd model indicates that it is a brand new model (i.e. starts from layer 1, not layer 4). I tried setting restore_best_weights to False but it still results in the huge difference for model 2.

I understand that I can run each separately, but I would like to run these together to perform further analysis after the models are trained. Also, I could just leave things as it is and run LSTM immediately after GRU, and just use LSTM to predict results, but it seems like I might be missing something really obvious which led to these different results. My thanks in advance!

user5305519
  • 3,008
  • 4
  • 26
  • 44
  • When you train a model on GPU, each time weights are initialized randomly, so the result is not perfectly reproducible, how long did you train? – Zabir Al Nazi May 11 '20 at 22:09
  • I set a random seed so that the results are consistent no matter how many times I run it. As for training duration, it takes about 5-15mins each. – user5305519 May 11 '20 at 22:20
  • Can you reproduce the same behaviour with dummy data and add a reproducible script? If yes, it's worth looking otherwise this just maybe a random training issue. – Zabir Al Nazi May 11 '20 at 22:31

1 Answers1

2

After doing some research, it turns out that even if you define 2 different models in the same kernel run, layers from the 2nd model will still be added to the 1st. That is why when I run the 2nd model, the loss at epoch 1 will always be the same, but the eventual result will differ because 2nd model = layers from 1st model + layers from 2nd model.

As explained here, I should use K.clear_session() to remove the old layers/nodes so that each model is a fresh start.

user5305519
  • 3,008
  • 4
  • 26
  • 44