LSTM, Exploding gradients or wrong approach?

Question

Having a dataset of monthly activity of users, segment to country and browser. each row is 1 day of user activity summed up and a score for that daily activity. For example: number of sessions per day is one feature. The score is a floating point number calculated from that daily features.

My goal is to try and predict the "average user" score at the end of the month using just 2 days of users data.

I have 25 month of data, some are full and some have only partial of the total days, in order to have a fixed batch size I've padded the sequences like so:

from keras.preprocessing.sequence import pad_sequences
padded_sequences = pad_sequences(sequences, maxlen=None, dtype='float64', padding='pre', truncating='post', value=-10.)

so sequences with less then the max where padded with -10 rows.
I've decided to create an LSTM model to digest the data, so at the end of each batch the model should predict the average user score. Then later I'll try to predict using just 2 days sample.

My Model look like that:

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dropout,Dense,Masking
from tensorflow.keras import metrics
from tensorflow.keras.callbacks import TensorBoard
from tensorflow.keras.optimizers import Adam

import datetime, os

model = Sequential()
opt = Adam(learning_rate=0.0001, clipnorm=1)

num_samples = train_x.shape[1]
num_features = train_x.shape[2]

model.add(Masking(mask_value=-10., input_shape=(num_samples, num_features)))
model.add(LSTM(64, return_sequences=True, activation='relu'))
model.add(Dropout(0.3))

#this is the last LSTM layer, use return_sequences=False
model.add(LSTM(64, return_sequences=False, stateful=False,  activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(1))

model.compile(loss='mse', optimizer='adam' ,metrics=['acc',metrics.mean_squared_error])

logdir = os.path.join(logs_base_dir, datetime.datetime.now().strftime("%Y%m%d-%H%M%S"))
tensorboard_callback = TensorBoard(log_dir=logdir, update_freq=1)
model.summary()

Model: "sequential_13"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
masking_5 (Masking)          (None, 4283, 16)          0         
_________________________________________________________________
lstm_20 (LSTM)               (None, 4283, 64)          20736     
_________________________________________________________________
dropout_14 (Dropout)         (None, 4283, 64)          0         
_________________________________________________________________
lstm_21 (LSTM)               (None, 64)                33024     
_________________________________________________________________
dropout_15 (Dropout)         (None, 64)                0         
_________________________________________________________________
dense_9 (Dense)              (None, 1)                 65        
=================================================================
Total params: 53,825
Trainable params: 53,825
Non-trainable params: 0
_________________________________________________________________

While training I get NaN value on the 19th epoch

Epoch 16/1000
16/16 [==============================] - 14s 855ms/sample - loss: 298.8135 - acc: 0.0000e+00 - mean_squared_error: 298.8135 - val_loss: 220.7307 - val_acc: 0.0000e+00 - val_mean_squared_error: 220.7307
Epoch 17/1000
16/16 [==============================] - 14s 846ms/sample - loss: 290.3051 - acc: 0.0000e+00 - mean_squared_error: 290.3051 - val_loss: 205.3393 - val_acc: 0.0000e+00 - val_mean_squared_error: 205.3393
Epoch 18/1000
16/16 [==============================] - 14s 869ms/sample - loss: 272.1889 - acc: 0.0000e+00 - mean_squared_error: 272.1889 - val_loss: nan - val_acc: 0.0000e+00 - val_mean_squared_error: nan
Epoch 19/1000
16/16 [==============================] - 14s 852ms/sample - loss: nan - acc: 0.0000e+00 - mean_squared_error: nan - val_loss: nan - val_acc: 0.0000e+00 - val_mean_squared_error: nan
Epoch 20/1000
16/16 [==============================] - 14s 856ms/sample - loss: nan - acc: 0.0000e+00 - mean_squared_error: nan - val_loss: nan - val_acc: 0.0000e+00 - val_mean_squared_error: nan
Epoch 21/1000

I tried to apply the methods described here with no real success.

Update: I've changed my activation from relu to tanh and it solved the NaN issue. However it seems that the accuracy of my model stays 0 while the loss goes down

Epoch 100/1000
16/16 [==============================] - 14s 869ms/sample - loss: 22.8179 - acc: 0.0000e+00 - mean_squared_error: 22.8179 - val_loss: 11.7422 - val_acc: 0.0000e+00 - val_mean_squared_error: 11.7422

Q: What am I doing wrong here?

I can imagine that this is related to using relu activation in the LSTM layers -- because it's not bounded, this will increase the likelihood of exploding activations/gradients. Have you tried using the default tanh activation? — xdurch0, Apr 23 '20 at 10:11
I really doubt about NaN loss to be evidence of Exploding gradient - if it was such - loss would be Inf, but it is NaN - mostly resembles vanishing or dying gradient — JeeyCi, Jun 04 '22 at 16:46

Zabir Al Nazi · Accepted Answer · 2020-04-23T11:14:16.607

1

You are solving a regression task, using accuracy is not meaningful here.

Use mean_absollute_error to check if your error is decreasing over time or not.

Instead of blindly predicting the score, you can make the score bounded to (0, 1).

Just use a min max normalization to bring the output in a range https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html

After that you can use sigmoid in last layer.

Also, you're choosing slightly longer sequences for this simple model 4283, how skewed your sequence lengths are?

Maybe do a histogram plot of all the signal length and see if 4283 is, in fact, a good choice or not. Maybe you can bring this down to something like 512 which may become easier for the model.

Also, padding with -10 seems a pretty weird choice is it something specific for your data or you're choosing randomly? This -10 also suggests you're not normalizing your input data which can become a problem with an LSTM with relu, maybe you should try to normalizing it before training.

After these add a validation plot of the mean absolute error if the performance is still not good.

edited Apr 23 '20 at 11:14

answered Apr 23 '20 at 11:11

Zabir Al Nazi

10,298
4
33
60

Thanks for your inputs, I'll give it a go and post my feedback – Roni Gadot Apr 23 '20 at 11:13
Thanks for all your super helpful inputs, 4283 is the max sequence, meaning this is the segment with the most user traffic, i.e for a certain day one segment can have 100 visitors and another can have a 1000, each batch is the total amount of user sessions over the entire month. I do normalize the values but I pad with -10 later. the -10 is just a number I've chose and I'm masking it in the model, does it matter? – Roni Gadot Apr 23 '20 at 11:55
Yes, I understand but usually choosing the max sequence length could be a bad choice if most of your segments have length near 1000, then choosing 1000 is a better option. The padding value should be something that doesn't appear commonly in the sequences. – Zabir Al Nazi Apr 23 '20 at 11:57
got it, I'll check and select by the average sequence – Roni Gadot Apr 23 '20 at 12:00
1

thanks for the tips, the model seems to be on the right path, I'll make some predictions and post the results – Roni Gadot Apr 23 '20 at 13:26
@ZabirAlNazi Can you please have a look here:https://stackoverflow.com/questions/61443234/python-keras-lstm-data-structure-valueerror – Shlomi Schwartz Apr 26 '20 at 15:23

LSTM, Exploding gradients or wrong approach?

1 Answers1