Training a LSTM auto-encoder gets NaN / super high MSE loss

Question

I'm trying to train a LSTM ae. It's like a seq2seq model, you throw a signal in to get a reconstructed signal sequence. And the I'm using a sequence which should be quite easy. The loss function and metric is MSE. The first hundred epochs went well. However after some epochs I got MSE which is super high and it goes to NaN sometimes. I don't know what causes this. Can you inspect the code and give me a hint? The sequence gets normalization before, so it's in a [0,1] range, how can it produce such a high MSE error? This is the input sequence I get from training set:

sequence1 = x_train[0][:128]

looks like this:

I get the data from a public signal dataset(128*1) This is the code: (I modify it from keras blog)

# lstm autoencoder recreate sequence
from numpy import array
from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense
from keras.layers import RepeatVector
from keras.layers import TimeDistributed
from keras.utils import plot_model
from keras import regularizers

# define input sequence. sequence1 is only a one dimensional list
# reshape sequence1 input into [samples, timesteps, features]
n_in = len(sequence1)
sequence = sequence1.reshape((1, n_in, 1))
# define model
model = Sequential()
model.add(LSTM(1024, activation='relu', input_shape=(n_in,1)))
model.add(RepeatVector(n_in))
model.add(LSTM(1024, activation='relu', return_sequences=True))
model.add(TimeDistributed(Dense(1)))
model.compile(optimizer='adam', loss='mse')
for epo in [50,100,1000,2000]:
   model.fit(sequence, sequence, epochs=epo)

The first few epochs went all well. all the losses are about 0.003X or so. Then it became big suddenly, to some very big number, the goes to NaN all the way up.

score 1 · Answer 1 · answered Mar 06 '20 at 08:23

You might have a problem with exploding gradient values when doing the backpropagation. Try using the clipnorm and clipvalue parameters to control gradient clipping: https://keras.io/optimizers/

Alternatively, what is the learning rate you are using? I would also try to reduce the learning rate by 10,100,1000 to check if you observe the same behavior.

score 1 · Answer 2 · answered Mar 06 '20 at 08:32

'relu' is the main culprit - see here. Possible solutions:

Initialize weights to smaller values, e.g. keras.initializers.TruncatedNormal(mean=0.0, stddev=0.01)
Clip weights (at initialization, or via kernel_constraint, recurrent_constraint, ...)
Increase weight decay
Use a warmup learning rate scheme (start low, gradually increase)
Use 'selu' activation, which is more stable, is ReLU-like, and works better than ReLU on some tasks

Since your training went stable for many epochs, 3 sounds the most promising, as it seems that eventually your weights norm gets too large and gradients explode. Generally, I suggest keeping the weight norms around 1 for 'relu'; you can monitor the l2 norms using the function below. I also recommend See RNN for inspecting layer activations & gradients.

def inspect_weights_l2(model, names='lstm', axis=-1):
    def _get_l2(w, axis=-1):
        axis = axis if axis != -1 else len(w.shape) - 1
        reduction_axes = tuple([ax for ax in range(len(w.shape)) if ax != axis])
        return np.sqrt(np.sum(np.square(w), axis=reduction_axes))

    def _print_layer_l2(layer, idx, axis=-1):
        W = layer.get_weights()
        l2_all = []
        txt = "{} "

        for w in W:
            txt += "{:.4f}, {:.4f} -- "
            l2 = _get_l2(w, axis)
            l2_all.extend([l2.max(), l2.mean()])
        txt = txt.rstrip(" -- ")

        print(txt.format(idx, *l2_all))

    names = [names] if isinstance(names, str) else names

    for idx, layer in enumerate(model.layers):
        if any([name in layer.name.lower() for name in names]):
            _print_layer_l2(layer, idx, axis=axis)

I saw a post with François Chollet from 2014 where people were discussing if the 'relu' activation is responsible for this (what you are also suggesting). The proposed fix was to use a 'sigmoid' activation function instead. I wanted to ask you if it is known why that is the case (did not get that info from the other post you highlighted)? — Dr. H. Lecter, Mar 06 '20 at 09:07
@Dr.H.Lecter I discourage using sigmoid; use `'tanh'` instead. They both circumvent the instability derived from recursively feeding an unbounded activation - i.e. imagine you feed `2` for 100 timesteps, that's 2^100. `'tanh'` relative to `'sigmoid'` enjoys superior gradient backprop and gating dynamics per also having negative activations. — OverLordGoldDragon, Mar 06 '20 at 09:59
Many thanks for your answer. My mistake they actually were recommending 'thanh' in the post. — Dr. H. Lecter, Mar 06 '20 at 10:39

Training a LSTM auto-encoder gets NaN / super high MSE loss

2 Answers2