22

In Keras you can specify a dropout layer like this:

model.add(Dropout(0.5))

But with a GRU cell you can specify the dropout as a parameter in the constructor:

model.add(GRU(units=512,
        return_sequences=True,
        dropout=0.5,
        input_shape=(None, features_size,)))

What's the difference? Is one preferable to the other?

In Keras' documentation it adds it as a separate dropout layer (see "Sequence classification with LSTM")

Ioannis Nasios
  • 8,292
  • 4
  • 33
  • 55
BigBadMe
  • 1,754
  • 1
  • 19
  • 27

1 Answers1

27

The recurrent layers perform the same repeated operation over and over.

In each timestep, it takes two inputs:

  • Your inputs (a step of your sequence)
  • Internal inputs (can be states and the output of the previous step, for instance)

Note that the dimensions of the input and output may not match, which means that "your input" dimensions will not match "the recurrent input (previous step/states)" dimesions.

Then in every recurrent timestep there are two operations with two different kernels:

  • One kernel is applied to "your inputs" to process and transform it in a compatible dimension
  • Another (called recurrent kernel by keras) is applied to the inputs of the previous step.

Because of this, keras also uses two dropout operations in the recurrent layers. (Dropouts that will be applied to every step)

  • A dropout for the first conversion of your inputs
  • A dropout for the application of the recurrent kernel

So, in fact there are two dropout parameters in RNN layers:

  • dropout, applied to the first operation on the inputs
  • recurrent_dropout, applied to the other operation on the recurrent inputs (previous output and/or states)

You can see this description coded either in GRUCell and in LSTMCell for instance in the source code.


What is correct?

This is open to creativity.

You can use a Dropout(...) layer, it's not "wrong", but it will possibly drop "timesteps" too! (Unless you set noise_shape properly or use SpatialDropout1D, which is currently not documented yet)

Maybe you want it, maybe you dont. If you use the parameters in the recurrent layer, you will be applying dropouts only to the other dimensions, without dropping a single step. This seems healthy for recurrent layers, unless you want your network to learn how to deal with sequences containing gaps (this last sentence is a supposal).

Also, with the dropout parameters, you will be really dropping parts of the kernel as the operations are dropped "in every step", while using a separate layer will let your RNN perform non-dropped operations internally, since your dropout will affect only the final output.

Daniel Möller
  • 84,878
  • 18
  • 192
  • 214
  • Thanks Daniel, your explanation of what the dropout is doing ties with the answer given here: https://stackoverflow.com/questions/44924690/ And yes, obviously setting on the cell constructor means I could also specify the recurrent dropout too, so I think I will set it as part of the constructor. One additional question; one assumes that when I come to doing .evaluate and .predict that Keras will automatically set the dropout probability to zero? I previously coded in TensorFlow and needed to specify that myself, but I assume Keras does that for me...? – BigBadMe Jun 06 '18 at 13:35
  • 1
    Yes, keras uses somewhere in its codes a condition `K.in_training_phase(expression_for_training,express_for_non_training)`. And dropouts are only applied in training phase. – Daniel Möller Jun 06 '18 at 13:54
  • See [L2212](https://github.com/keras-team/keras/blob/master/keras/layers/recurrent.py#L2212) – Daniel Möller Jun 06 '18 at 13:55
  • Brilliant, thanks very much for your help Daniel, much appreciated. – BigBadMe Jun 06 '18 at 18:35
  • If you would use a Dropout() after a RNN/LSTM/GRU with `return_sequences=True` and before a new RNN/LSTM/etc. layer - that would be the same as setting the dropout of the next layer. Correct? – Maverick Meerkat Nov 18 '19 at 19:01
  • No, dropouts of RNN layers have special behavior for RNN layers. – Daniel Möller Nov 18 '19 at 19:19
  • @DanielMöller - sorry I am a complete beginner. As what I understand from above, you recommend using a dropout within the lstm and removing the dropouts outside of it as commented out below? model = Sequential() model.add(LSTM(256, input_shape=(trainX.shape[1], trainX.shape[2]), Dropout=0.1, return_sequences=True)) model.add(LSTM(128, Dropout=0.1, return_sequences=True)) model.add(LSTM(32, return_sequences=True)) # model.add(Dropout(0.1)) model.add(Dense(14)) model.add(Dense(trainY.shape[1])) model.compile(optimizer='adam', loss='mean_squared_error') – SamV Jul 19 '21 at 10:10
  • @SamirVinchurkar, there is no absolute answer for neural networks. This is a choice you have to do experimentally and see what gives you the best results. – Daniel Möller Jul 19 '21 at 12:57
  • @DanielMöller Thanks! --I understand, and I am only seeking a recommendation based on your experience. I was using ReLu and had tried removing 0s, clipping, lr, changing trainx sizes etc - tanh really did the trick for me. Let me be a bit specific about the question: do you recommend using recurrent dropouts within lstm since using it outside would result in training data loss (hence possibly lower performance)? Or am I off track here? – SamV Jul 19 '21 at 14:37
  • @SamirVinchurkar, again, there is no correct answer. Test it with your data and your model. – Daniel Möller Jul 19 '21 at 15:45