I am using LSTM Networks for Multivariate Multi-Timestep predictions.
So basically seq2seq
prediction where a number of n_inputs
is fed into the model in order to predict a number of n_outputs
of a time series.
My question is how to meaningfully apply Dropout
and BatchnNormalization
as this appears to be a highly discussed topic for Recurrent and therefore LSTM Networks. Lets stick to Keras as framework for the sake of simplicity.
Case 1: Vanilla LSTM
model = Sequential()
model.add(LSTM(n_blocks, activation=activation, input_shape=(n_inputs, n_features), dropout=dropout_rate))
model.add(Dense(int(n_blocks/2)))
model.add(BatchNormalization())
model.add(Activation(activation))
model.add(Dense(n_outputs))
- Q1: Is it good practice not to use BatchNormalization directly after LSTM layers?
- Q2: Is it good practice to use Dropout inside LSTM layer?
- Q3: Is the usage of BatchNormalization and Dropout between the Dense layers good practice?
- Q4: If I stack multiple LSTM layers, is it a good idea to use BatchNormalization between them?
Case 2: Encoder Decoder like LSTM with TimeDistributed Layers
model = Sequential()
model.add(LSTM(n_blocks, activation=activation, input_shape=(n_inputs,n_features), dropout=dropout_rate))
model.add(RepeatVector(n_outputs))
model.add(LSTM(n_blocks, activation=activation, return_sequences=True, dropout=dropout_rate))
model.add(TimeDistributed(Dense(int(n_blocks/2)), use_bias=False))
model.add(TimeDistributed(BatchNormalization()))
model.add(TimeDistributed(Activation(activation)))
model.add(TimeDistributed(Dropout(dropout_rate)))
model.add(TimeDistributed(Dense(1)))
- Q5: Should
BatchNormalozation
andDropout
wrapped insideTimeDistributed
layers when used betweenTimeDistributed(Dense())
layers, or is it correct to leave them without? - Q6: Can or should Batchnormalization be applied after, before, or in between the Encoder-Decoder LSTM Blocks?
Q7: If a
ConvLSTM2D
layer is used as first Layer (Encoder) would this make a difference in the usage of Dropout and BatchNormalization?Q8: should the
recurrent_dropout
argument be used inside LSTM blocks? If yes should it be combined with normaldropout
argument as it is in the example, or should it be exchanged? Thank you very much in advance!