Batch normalization layer for CNN-LSTM

Question

Suppose that I have a model like this (this is a model for time series forecasting):

ipt   = Input((data.shape[1] ,data.shape[2])) # 1
x     = Conv1D(filters = 10, kernel_size = 3, padding = 'causal', activation = 'relu')(ipt) # 2
x     = LSTM(15, return_sequences = False)(x) # 3
x = BatchNormalization()(x) # 4
out   = Dense(1, activation = 'relu')(x) # 5

Now I want to add batch normalization layer to this network. Considering the fact that batch normalization doesn't work with LSTM, Can I add it before Conv1D layer? I think it's rational to have a batch normalization layer after LSTM.

Also, where can I add Dropout in this network? The same places? (after or before batch normalization?)

What about adding AveragePooling1D between Conv1D and LSTM? Is it possible to add batch normalization between Conv1D and AveragePooling1D in this case without any effect on LSTM layer?

BN _can_ be used with LSTMs - your linked SO's top answer gives a false verdict. Avoid Dropout between LSTMs - `recurrent_dropout` should work better. — OverLordGoldDragon, Dec 11 '19 at 19:10
@OverLordGoldDragon So you're saying I can add `BatchNormalization` layer before LSTM in my case? Could you please add an answer with more details? — Eghbal, Dec 11 '19 at 20:08
Depends on the kind of 'answer' you seek; I cannot "explain" it at this time, as I plan to make a separate Q&A dedicated to explaining BatchNorm entirely (existing material doesn't do the topic justice) - but in my application of EEG classification, BatchNorm dominated LayerNorm for exactly a CNN-LSTM architecture. If satisfactory, I can just state some good practices worth trying — OverLordGoldDragon, Dec 11 '19 at 20:16
@OverLordGoldDragon It's good. You can add your experience with `CNN-LSTM`. — Eghbal, Dec 11 '19 at 20:26
Alright; I'll request a bit more info ahead: what type of data are you dealing with (stocks, signals, etc) , and what are the sequence lengths? What dimension is the data (# of input channels / variables)? — OverLordGoldDragon, Dec 11 '19 at 20:29
@OverLordGoldDragon. It's a forecasting case (stocks). The number of training samples is near 2000 (I have also another case with 1000 samples). For the length of sequence, It's fixed and equal to 21. The number of variables is flexible (something between 7 to near 50). — Eghbal, Dec 11 '19 at 20:34

OverLordGoldDragon · Accepted Answer · 2020-02-04T21:53:57.533

Update: the LayerNormalization implementation I was using was inter-layer, not recurrent as in the original paper; results with latter may prove superior.

BatchNormalization can work with LSTMs - the linked SO gives false advice; in fact, in my application of EEG classification, it dominated LayerNormalization. Now to your case:

"Can I add it before Conv1D"? Don't - instead, standardize your data beforehand, else you're employing an inferior variant to do the same thing
Try both: BatchNormalization before an activation, and after - apply to both Conv1D and LSTM
If your model is exactly as you show it, BN after LSTM may be counterproductive per ability to introduce noise, which can confuse the classifier layer - but this is about being one layer before output, not LSTM
If you aren't using stacked LSTM with return_sequences=True preceding return_sequences=False, you can place Dropout anywhere - before LSTM, after, or both
Spatial Dropout: drop units / channels instead of random activations (see bottom); was shown more effective at reducing coadaptation in CNNs in paper by LeCun, et al, w/ ideas applicable to RNNs. Can considerably increase convergence time, but also improve performance
recurrent_dropout is still preferable to Dropout for LSTM - however, you can do both; just do not use with with activation='relu', for which LSTM is unstable per a bug
For data of your dimensionality, any sort of Pooling is redundant and may harm performance; scarce data is better transformed via a non-linearity than simple averaging ops
I strongly recommend a SqueezeExcite block after your Conv; it's a form of self-attention - see paper; my implementation for 1D below
I also recommend trying activation='selu' with AlphaDropout and 'lecun_normal' initialization, per paper Self Normalizing Neural Networks
Disclaimer: above advice may not apply to NLP and embed-like tasks

Below is an example template you can use as a starting point; I also recommend the following SO's for further reading: Regularizing RNNs, and Visualizing RNN gradients

from keras.layers import Input, Dense, LSTM, Conv1D, Activation
from keras.layers import AlphaDropout, BatchNormalization
from keras.layers import GlobalAveragePooling1D, Reshape, multiply
from keras.models import Model
import keras.backend as K
import numpy as np


def make_model(batch_shape):
    ipt = Input(batch_shape=batch_shape)
    x   = ConvBlock(ipt)
    x   = LSTM(16, return_sequences=False, recurrent_dropout=0.2)(x)
    # x   = BatchNormalization()(x)  # may or may not work well
    out = Dense(1, activation='relu')

    model = Model(ipt, out)
    model.compile('nadam', 'mse')
    return model

def make_data(batch_shape):  # toy data
    return (np.random.randn(*batch_shape),
            np.random.uniform(0, 2, (batch_shape[0], 1)))

batch_shape = (32, 21, 20)
model = make_model(batch_shape)
x, y  = make_data(batch_shape)

model.train_on_batch(x, y)

Functions used:

def ConvBlock(_input):  # cleaner code
    x   = Conv1D(filters=10, kernel_size=3, padding='causal', use_bias=False,
                 kernel_initializer='lecun_normal')(_input)
    x   = BatchNormalization(scale=False)(x)
    x   = Activation('selu')(x)
    x   = AlphaDropout(0.1)(x)
    out = SqueezeExcite(x)    
    return out

def SqueezeExcite(_input, r=4):  # r == "reduction factor"; see paper
    filters = K.int_shape(_input)[-1]

    se = GlobalAveragePooling1D()(_input)
    se = Reshape((1, filters))(se)
    se = Dense(filters//r, activation='relu',    use_bias=False,
               kernel_initializer='he_normal')(se)
    se = Dense(filters,    activation='sigmoid', use_bias=False, 
               kernel_initializer='he_normal')(se)
    return multiply([_input, se])

Spatial Dropout: pass noise_shape = (batch_size, 1, channels) to Dropout - has the effect below; see Git gist for code:

Thank you for this detailed answer. I think there is also a doubt about `Shuffle` in `fit` for time series forecasting using sequential models in TensorFlow. Some people say we should keep the default value (`True`), but the others insist on changing it. I think it's just related to the order of batches for optimization so if we set it as `False`; we start optimizing the model from beginning of data to the end. How can we support this idea by the concert of time series forecasting? — Eghbal, Dec 12 '19 at 00:13
@user2991243 Roll with `True`; the only shuffling of concern is _samples within batches_ for `stateful=True` layers - see "When and how does LSTM pass states in stateful?" in [this answer](https://stackoverflow.com/questions/58276337/proper-way-to-feed-time-series-data-to-stateful-lstm/58277760#58277760) — OverLordGoldDragon, Dec 12 '19 at 00:18
@user2991243 General learning principles, nothing LSTM-specific; helps prevent model adaptation to order of batches in dataset. ... at least that's what most will tell you; the full answer is more complex and a subject of its own topic, but in a nutshell: "loss surface diversification". — OverLordGoldDragon, Dec 12 '19 at 00:23
@user2991243 See "How does 'learning' work?" in [this answer](https://stackoverflow.com/questions/48714407/rnn-regularization-which-component-to-regularize/58868383#58868383) — OverLordGoldDragon, Dec 12 '19 at 00:25
You mentioned that 'if you aren't using stacked LSTM with return_sequences=True preceding return_sequences=False, you can place Dropout anywhere - before LSTM, after, or both'. If we use stacked LSTM, we can't use `Dropout` between LSTM layers, but it's possible to use it before and after these two layers. — Eghbal, Dec 12 '19 at 18:53
@user2991243 Between LSTM, rate should be limited to `0.2` tops, but if I use it I go with `0.1`; also consider Spatial Dropout - see updated answer. As for right before LSTM, that's a bit questionable; I opt for a "warmup" route, where I start with 0.2 pre-LSTM dropout and max out at 0.5 late-stage - but after LSTM for `return_sequences=False`, any usual dropout rate should be fine. Also as a disclaimer, take most my tips on specific rates with grain of salt, as I work with very long sequences, whereas you have less timesteps to spare - best bet is to experiment and see what happens — OverLordGoldDragon, Dec 12 '19 at 19:21
Do we have still the mentioned issue (LSTM is unstable per a bug when we use `recurrent_dropout`) in Tensorflow 2 and lastest version of Keras? — Eghbal, Dec 13 '19 at 13:39
Also about your recently mentioned `Spatial Dropout`. I think we can only apply this on `Dropout` before `LSTM` and after `CNN`, not the `Dropout` after `LSTM`. Is this correct? — Eghbal, Dec 13 '19 at 14:38
@user2991243 Yes, see minimal example [here](https://stackoverflow.com/questions/57516678/lstm-recurrent-dropout-with-relu-yields-nans) - also the case for `tensorflow.keras`. And I can't speak from experience as I never tried it, but per linked LeCun et al paper, Spatial Dropout on LSTMs should work even better than regular dropout, as they _do not corrupt timesteps_ - instead they drop entire channels, which is a form of "cleaner noise". — OverLordGoldDragon, Dec 13 '19 at 18:11

Batch normalization layer for CNN-LSTM

1 Answers1

Linked