5

I am training a LSTM autoencoder, but the loss function randomly shoots up as in the picture below: screenshot of explosion in loss function I tried multiple to things to prevent this, adjusting the batch size, adjusting the number of neurons in my layers, but nothing seems to help. I checked my input data to see if it contains null / infinity values, but it doesn't, it is normalized also. Here is my code for reference:

model = Sequential()
model.add(Masking(mask_value=0, input_shape=(430, 3)))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2, activation='relu'))
model.add(RepeatVector(430))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2, activation='relu', return_sequences=True))
model.add(TimeDistributed(Dense(3)))
model.compile(optimizer='adam', loss='mean_squared_error', metrics=['accuracy'])

context_paths = loadFile()
X_train, X_test = train_test_split(context_paths, test_size=0.20)

history = model.fit(X_train, X_train, epochs=1, batch_size=4, verbose=1, validation_data=(X_test, X_test))

The loss function explodes at random points in time, sometimes sooner, sometimes later. I read this thread about possible problems, but at this point after trying multiple things I am not sure what to do to prevent the loss function from skyrocketing at random. Any advice is appreciated. Other than this I can see that my accuracy is not increasing very much, so the problems may be interconnected.

Michael Kročka
  • 617
  • 7
  • 22
  • Same issue today! I have no idea why! I am building LSTM autoencoder with Adam as base optimizer. – Avv Jul 09 '21 at 03:04

2 Answers2

8

Two main points:

1st point As highlighted by Daniel Möller: Don't use 'relu' for LSTM, leave the standard activation which is 'tanh'.

2nd point: One way to fix the exploding gradient is to use clipnorm or clipvalue for the optimizer

Try something like this for the last two lines

For clipnorm:

opt = tf.keras.optimizers.Adam(clipnorm=1.0)

For clipvalue:

opt = tf.keras.optimizers.Adam(clipvalue=0.5)

See this post for help (previous version of TF): How to apply gradient clipping in TensorFlow?

And this post for general explanation: https://machinelearningmastery.com/how-to-avoid-exploding-gradients-in-neural-networks-with-gradient-clipping/

Dr. H. Lecter
  • 478
  • 2
  • 5
  • 16
  • 1
    Thank you for your suggestion, the explosion in loss function was solved by removing ReLUs, so I haven't experimented with clipnorm and clipvalue. – Michael Kročka Mar 25 '20 at 09:25
  • Then you are good, no need to play with those! There are safety nets parameters anyway might be useful to include them if you plan to train on several different datasets. – Dr. H. Lecter Mar 25 '20 at 10:38
  • @Dr.H.Lecter. Thank you very much Doctor! It works as I was getting inf for first epoch and then nan for later epoches. I replaced relu by tanh and also used clipnorm, which works fine now, but I still get high loss anyway: Epoch 1/10 1/1 - 8s - loss: 91188.7188 Epoch 2/10 1/1 - 0s - loss: 91179.7031 Epoch 3/10 1/1 - 0s - loss: 91169.9688 Epoch 4/10 1/1 - 0s - loss: 91157.8672 Any idea why that happened please? By the way my original data has a lot of 0s and 1s as well as a mix of positive and negative values. I did normalize my data as well. – Avv Jul 09 '21 at 03:09
  • I deleted 0s and 1s and my loss now is 0.9! However, those deleted values are important as they mean switches off and on of electrical substations. Is this a good idea please! – Avv Jul 09 '21 at 03:40
5

Two main issues:

  • Don't use 'relu' for LSTM, leave the standard activation which is 'tanh'. Because LSTM's are "recurrent", it's very easy for them to accumulate growing or decreasing of values to a point of making the numbers useless.
  • Check the range of your data X_train and X_test. Make sure they're not too big. Something between -4 and +4 is sort of good. You should consider normalizing your data if it's not normalized yet.

Notice that "accuracy" doesn't make any sense for problems that are not classificatino. (I notice your final activation is "linear", so you're not doing classification, right?)


Finally, if the two hints above don't work. Check whether you have an example that is all zeros, this might be creating a "full mask" sequence, and this "might" (I don't know) cause a bug.

(X_train == 0).all(axis=[1,2]).any() #should be false
Daniel Möller
  • 84,878
  • 18
  • 192
  • 214
  • Right now I'm just trying to use an autoencoder to learn a representation of my data. Later when I reach a good enough accuracy breakpoint I want to use the encoder part with conjunction with a custom clustering layer, so that my data is (hopefully) divided into clear clusters. So in a sense it is classification without predesigned classes. My final action is linear because my data is in range [-2, 2] and I haven't found any activation function with such a range, tanh is only [-1,1]. – Michael Kročka Mar 25 '20 at 09:23
  • Ok, -2 to 2 is reasonable, but "relu" in LSTM is really troublesome. Don't use it, leave the default. --- If the initial predictions of your model are too far from this range, you might like to have a BatchNormalization (not really necessary) before or after the last Dense. – Daniel Möller Mar 25 '20 at 11:54
  • Thank you very much. I did what you mentioned about tanh and clipvalue, but I still have high loss. Probably because I have half of each rows consisting of 0s and 1s? But removing them might produce wrong results? Please correct me if I am wrong? Epoch 1/10 1/1 - 9s - loss: 91187.0781 Epoch 2/10 1/1 - 0s - loss: 91178.6875 Epoch 3/10 1/1 - 0s - loss: 91168.4688 – Avv Jul 09 '21 at 03:23
  • This is what I got for first 3 epoches after I replaced relu with tanh (high loss!): Epoch 1/10 1/1 - 9s - loss: 91189.1953 Epoch 2/10 1/1 - 0s - loss: 91176.1953 Epoch 3/10 1/1 - 0s - loss: 91164.1172 ... When I deleted 0s and 1s from my each row, the results got better loss around 0.9. But deleting those values is not a good idea since those values mean off and on of switches. Any idea about that please? – Avv Jul 09 '21 at 03:33
  • @DanielMöller Thank you; this helped! --- I was getting NaNs for predicting one output from input of 7 variables in a 3 year weather dataset using this: model = Sequential() model.add(LSTM(64, activation='relu', input_shape=(trainX.shape[1], trainX.shape[2]), return_sequences=True)) model.add(LSTM(32, activation='relu', return_sequences=False)) model.add(Dropout(0.1)) model.add(Dense(14, activation='relu')) model.compile(optimizer= Adam(learning_rate = 0.0001), loss="mean_squared_error")......Any other improvements advisable?? – SamV Jul 15 '21 at 21:33