34

I know in regular neural nets people use batch norm before activation and it will reduce the reliance on good weight initialization. I wonder if it would do the same to RNN/lstm RNN when i use it. Does anyone have any experience with it?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Peter Deng
  • 477
  • 1
  • 4
  • 9

5 Answers5

30

No, you cannot use Batch Normalization on a recurrent neural network, as the statistics are computed per batch, this does not consider the recurrent part of the network. Weights are shared in an RNN, and the activation response for each "recurrent loop" might have completely different statistical properties.

Other techniques similar to Batch Normalization that take these limitations into account have been developed, for example Layer Normalization. There are also reparametrizations of the LSTM layer that allow Batch Normalization to be used, for example as described in Recurrent Batch Normalization by Coijmaans et al. 2016.

Dr. Snoopy
  • 55,122
  • 7
  • 121
  • 140
  • 20
    This answer is not correct. You **can** use batch normalization in recurrent networks: https://arxiv.org/abs/1603.09025 In fact, many DL frameworks have it implemented in corresponding classes. – minerals Jun 03 '18 at 18:46
  • 15
    @minerals The paper you linked literally says that you have to reparametrize the LSTM to make Batch Normalization usable with it, so my answer stands, you cannot use it with vanilla recurrent networks, they need modifications or a different form of BN. I will add this reference to the question. – Dr. Snoopy Jun 12 '19 at 13:28
17

Batch normalization applied to RNNs is similar to batch normalization applied to CNNs: you compute the statistics in such a way that the recurrent/convolutional properties of the layer still hold after BN is applied.

For CNNs, this means computing the relevant statistics not just over the mini-batch, but also over the two spatial dimensions; in other words, the normalization is applied over the channels dimension.

For RNNs, this means computing the relevant statistics over the mini-batch and the time/step dimension, so the normalization is applied only over the vector depths. This also means that you only batch normalize the transformed input (so in the vertical directions, e.g. BN(W_x * x)) since the horizontal (across time) connections are time-dependent and shouldn't just be plainly averaged.

The Guy with The Hat
  • 10,836
  • 8
  • 57
  • 75
velocirabbit
  • 706
  • 7
  • 16
  • 2
    this is great explanation. But can normalization applied between the LSTM layers if the model includes more than1 LSTM layer, where the output of layer i becomes the input to layer (i+1)? In this case, the output of layer i will be normalized... – edn Sep 09 '18 at 02:09
  • 3
    Just a reminder for anyone coming across this comment. Be careful with variable sequence lengths. If there is padding in your batch make sure that isn't included in your batch norm. – Benjamin Striner Jul 03 '19 at 20:19
8

In any non-recurrent network (convnet or not) when you do BN each layer gets to adjust the incoming scale and mean so the incoming distribution for each layer doesn't keep changing (which is what the authors of the BN paper claim is the advantage of BN).

The problem with doing this for the recurrent outputs of an RNN is that the parameters for the incoming distribution are now shared between all timesteps (which are effectively layers in backpropagation-through-time, or BPTT). So the distribution ends up being fixed across the temporal layers of BPTT. This may not make sense as there may be structure in the data that varies (in a non-random way) across the time series. For example, if the time series is a sentence certain words are much more likely to appear in the beginning or end. So having the distribution fixed might reduce the effectiveness of BN.

DankMasterDan
  • 1,900
  • 4
  • 23
  • 35
3

It is not commonly used, though I found this paper from 2017 shows a way to use batch normalization in the input-to-hidden and the hidden-to-hidden transformations trains faster and generalizes better on some problems.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
David Taub
  • 734
  • 1
  • 7
  • 27
3

The answer is Yes and No.

Why Yes, according to the paper layer normalization, in section it clearly indicates the usage of BN in RNNs.

Why No? The distribution of output at each timestep has to be stored and calcualted to conduct BN. Imagine that you pad the sequence input so all examples have the same length, so if the predict case is longer than all training cases, at some time step you have no mean/std of the output distribution summarized from the SGD training procedure.

Meanwhile, at least in Keras, I believe the BN layer only consider the normalization in vertical direction, i.e., the sequence output. The horizontal direction, i.e., hidden_status, cell_status, are not normalized. Correct me if I an wrong here.

In multiple-layer RNNs, you may consider using layer normalization tricks.

Bs He
  • 717
  • 1
  • 10
  • 22
  • 1
    Why this is not an issue for LayerNorm in RNNs? I see the argument that you need to maintain all the statistics for every time step. Doesn't LayerNorm need to maintain the statistics for each time step as well? – Touma Jun 12 '20 at 17:45
  • 1
    @Touma No, LN does not need to store time-dependent statistics. It is done layer-wise, i.e., for each sample, the normalization is done individually. It does not rely on other samples, and therefore it does not store the statistics for each time step. – Bs He Jun 13 '20 at 18:13