computing the loss across all LSTM timesteps then predicting on partial data

Question

I've written the following model to solve a regression problem:

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dropout,Dense,Masking
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.callbacks import TensorBoard,ModelCheckpoint
from tensorflow.keras import metrics

def build_model(num_samples, num_features, is_training):
  batch_size = None if is_training else 1 # batch size is 1 when predicting
  is_stateful = False if is_training else True # Model is staeful when predicting
  opt = RMSprop(0.001) 

  model = Sequential()
  model.add(Masking(mask_value=-10., input_shape=(num_samples, num_features)))
  model.add(LSTM(32, return_sequences=True, stateful=is_stateful, activation='tanh' ,batch_input_shape=(batch_size, num_samples, num_features)))
  model.add(Dropout(0.3))
  model.add(LSTM(16, return_sequences=True, stateful=is_stateful,  activation='tanh'))
  model.add(Dropout(0.3))
  model.add(Dense(16, activation='tanh'))
  model.add(Dense(8, activation='tanh'))
  model.add(Dense(1))
  if is_training:
    model.compile(loss='mse', optimizer=opt ,metrics=['mae','mse'])
  return model

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
masking (Masking)            (None, 2720, 16)          0         
_________________________________________________________________
lstm (LSTM)                  (None, 2720, 32)          6272      
_________________________________________________________________
dropout (Dropout)            (None, 2720, 32)          0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 2720, 16)          3136      
_________________________________________________________________
dropout_1 (Dropout)          (None, 2720, 16)          0         
_________________________________________________________________
dense (Dense)                (None, 2720, 16)          272       
_________________________________________________________________
dense_1 (Dense)              (None, 2720, 8)           136       
_________________________________________________________________
dense_2 (Dense)              (None, 2720, 1)           9         
=================================================================
Total params: 9,825
Trainable params: 9,825
Non-trainable params: 0
_________________________________________________________________

The model seems to be converging and after 3000 epochs the mae is ~3.2

When predicting, the model is staeful, batch size is 1:

Model: "sequential_22"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
masking_4 (Masking)          (1, 1, 16)                0         
_________________________________________________________________
lstm_8 (LSTM)                (1, 1, 32)                6272      
_________________________________________________________________
dropout_8 (Dropout)          (1, 1, 32)                0         
_________________________________________________________________
lstm_9 (LSTM)                (1, 1, 16)                3136      
_________________________________________________________________
dropout_9 (Dropout)          (1, 1, 16)                0         
_________________________________________________________________
dense_12 (Dense)             (1, 1, 16)                272       
_________________________________________________________________
dense_13 (Dense)             (1, 1, 8)                 136       
_________________________________________________________________
dense_14 (Dense)             (1, 1, 1)                 9         
=================================================================
Total params: 9,825
Trainable params: 9,825
Non-trainable params: 0
_________________________________________________________________

The prediction model is then populated with the trained weights and the predicted score is retrieved after each sample like so:

import tensorflow as tf

best_model = tf.keras.models.load_model('checkpoint.h5')
predicting_model = build_model(2720, 16, False) #:False to create a statless model with bacth size 1
predicting_model.set_weights(best_model.get_weights())

#Printing the desired targets
for index, row in enumerate(validation_y):
  if(index % 2720):
    print(index,row[0])

#Printing result for each sample
for index, batch in enumerate(validation_x):
  for index, sample in enumerate(batch):
    print(predicting_model.predict_on_batch(np.array([[sample]])))
print(index,"-------")
predicting_model.reset_states()

output:

1 [17.28016644]
2 [13.66593599]
3 [13.30965909]
4 [16.94327097]
5 [10.93074054]
6 [12.86584576]
7 [16.85743802]
8 [24.30536226]
9 [18.39125296]

----- Predictions -----
tf.Tensor([[[18.379564]]], shape=(1, 1, 1), dtype=float32)
tf.Tensor([[[18.379564]]], shape=(1, 1, 1), dtype=float32)
tf.Tensor([[[18.379564]]], shape=(1, 1, 1), dtype=float32)
tf.Tensor([[[18.379564]]], shape=(1, 1, 1), dtype=float32)
tf.Tensor([[[18.379564]]], shape=(1, 1, 1), dtype=float32)
tf.Tensor([[[18.379564]]], shape=(1, 1, 1), dtype=float32)
tf.Tensor([[[18.379564]]], shape=(1, 1, 1), dtype=float32)
    ...
    ...
    ...

Q: The predicted results are all the same, what am I doing wrong?

Update: I've tried printing just one sample prior to predicting to see what am I giving to the model, the input is different but the result is the same:

for index, batch in enumerate(validation_x):
  for index, sample in enumerate(batch):
    print(np.array([[sample]]))
    print(predicting_model.predict_on_batch(np.array([[sample]])))
    break
  print(index,"-------")
  predicting_model.reset_states()


[[[ 0.00000000e+00  3.42251853e-04  0.00000000e+00  0.00000000e+00
    2.59216149e-03  0.00000000e+00  0.00000000e+00  4.29978079e-03
    7.85496556e-05  0.00000000e+00 -8.93542054e-05 -3.11892174e-04
    0.00000000e+00  0.00000000e+00  2.17638422e-03  3.16997379e-03]]]
[[[18.468756]]]  <--- RESULT
0 -------
[[[ 0.00000000e+00  1.02675556e-03  0.00000000e+00  0.00000000e+00
    5.18432298e-03  3.34065889e-03  0.00000000e+00  2.80437035e-03
    0.00000000e+00  0.00000000e+00 -8.93542054e-05 -3.11892174e-04
    0.00000000e+00  0.00000000e+00  2.17638422e-03  9.84846226e-04]]]
[[[18.468756]]]  <--- RESULT
0 -------
[[[ 0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
    5.18432298e-03  1.33626356e-03  0.00000000e+00  2.94094896e-03
    1.57099311e-04  0.00000000e+00 -8.93542054e-05 -3.11892174e-04
    0.00000000e+00  0.00000000e+00  2.17638422e-03  8.92516892e-04]]]
[[[18.468756]]]  <--- RESULT

Update 2: Just to be clear, I'm splitting my data into train and validation set, but when training the training data is used with a validation split of 0.3

training_model.fit(train_x, train_y, epochs=3000, batch_size=128,validation_split = 0.3, callbacks=[tensorboard_callback,checkpoint_callback])

The prediction model would be stateful according to this line: `is_stateful = False if is_training else True`?!! — today, Apr 27 '20 at 16:16
Did you train with validation data? Are there -10 in the X you're trying to predict? You're still passing 2720 steps at once to the stateful model. (Why is it stateful, by the way?) — Daniel Möller, Apr 27 '20 at 16:35
@DanielMöller I've made some changes to the code based on your comment, I did not use the validation data in the training set. Currently, there are -10 since I've just split the data to train and validation. The model is stateful because In production I will not have all the samples at once but one at a time. It is based on an older question and some info I got from you ;) https://stackoverflow.com/questions/53190253/stateful-lstm-and-stream-predictions/53344603#53344603 — Shlomi Schwartz, Apr 27 '20 at 18:29
So, you need to use validation data in training to avoid overfitting. It's normal for the outputs to get repeated when you pass a masked value (-10). — Daniel Möller, Apr 27 '20 at 19:00
Did you change this line? `predicting_model = build_model(2720, 16, False)` — Daniel Möller, Apr 27 '20 at 19:02
What is the shape of `validation_x`? Are the last two lines idented corectly? — Daniel Möller, Apr 27 '20 at 19:04
What is the result of `model_training.predict_on_batch(training_x[:1])`? — Daniel Möller, Apr 27 '20 at 19:15
thanks, 1) I've changed the line. 2) shape is (10, 2720, 16). 3) result is `array([[[18.468756], [18.468756], [18.468756], ..., [18.468756], [18.468756], [18.468756]]], dtype=float32)` — Shlomi Schwartz, Apr 27 '20 at 20:19
@DanielMöller Why do I need to train with validation data? validation data is an unseen set to be tested later on, no? — Shlomi Schwartz, Apr 27 '20 at 20:20
So, you just discovered that the problem is happening also with train data, so your model is not training well. — Daniel Möller, Apr 27 '20 at 20:46
You need validation data curing training to make sure your model is not overfitting. (Not "train with" validation data, but "use validation data to check overfitting") — Daniel Möller, Apr 27 '20 at 20:46
But I use a valuation split of 0.3 over there training set, so basically I have train, test, valuation — Shlomi Schwartz, Apr 28 '20 at 05:38

computing the loss across all LSTM timesteps then predicting on partial data

0 Answers0

Linked