116

I am trying to grasp what TimeDistributed wrapper does in Keras.

I get that TimeDistributed "applies a layer to every temporal slice of an input."

But I did some experiment and got the results that I cannot understand.

In short, in connection to LSTM layer, TimeDistributed and just Dense layer bear same results.

model = Sequential()
model.add(LSTM(5, input_shape = (10, 20), return_sequences = True))
model.add(TimeDistributed(Dense(1)))
print(model.output_shape)

model = Sequential()
model.add(LSTM(5, input_shape = (10, 20), return_sequences = True))
model.add((Dense(1)))
print(model.output_shape)

For both models, I got output shape of (None, 10, 1).

Can anyone explain the difference between TimeDistributed and Dense layer after an RNN layer?

U13-Forward
  • 69,221
  • 14
  • 89
  • 114
Buomsoo Kim
  • 1,283
  • 2
  • 9
  • 5
  • 2
    There currently ssem to be no difference, [here](https://github.com/fchollet/keras/issues/278) a discussion about it. I think the original intent was to make a distinction between the `Dense` layer flattening the input and then reshaping, hence connecting different time steps and having more parameters, and `TimeDistributed` keeping the time steps separated (hence having less parameters). In your case `Dense` should have had 500 paramters, `TimeDistributed` only 50 – gionni Nov 15 '17 at 13:00
  • @gionni Nope, it has same number of parameters (both 6). So there is virtually no difference atm? – Buomsoo Kim Nov 16 '17 at 02:09
  • Yeah exactly, those are the number of parameters they would have if there was a difference. At the moment there isn't – gionni Nov 16 '17 at 12:17

2 Answers2

107

In keras - while building a sequential model - usually the second dimension (one after sample dimension) - is related to a time dimension. This means that if for example, your data is 5-dim with (sample, time, width, length, channel) you could apply a convolutional layer using TimeDistributed (which is applicable to 4-dim with (sample, width, length, channel)) along a time dimension (applying the same layer to each time slice) in order to obtain 5-d output.

The case with Dense is that in keras from version 2.0 Dense is by default applied to only last dimension (e.g. if you apply Dense(10) to input with shape (n, m, o, p) you'll get output with shape (n, m, o, 10)) so in your case Dense and TimeDistributed(Dense) are equivalent.

Marcin Możejko
  • 39,542
  • 10
  • 109
  • 120
  • 4
    There's an example of using TimeDistributed wrapping the model itself. When this is applied to an `Input` tensor, is there any difference from this compared to just doing a `map` of the model applied to a list that contains each slice of the `Input`? – CMCDragonkai May 18 '18 at 05:07
1
B = 2 # number of batches
d_model = 8 # embedding dimension
T = 3 # number of timesteps

dense_layer = tf.keras.layers.Dense(16)
inp = np.random.randn(B, T, d_model)

# using TimeDistributed layer
inputs = tf.keras.Input(shape=(T, d_model)) # (B, T, d_model)
outputs = tf.keras.layers.TimeDistributed(dense_layer)(inputs) # (B, T, 16)
model1 = keras.Model(inputs, outputs)

otpt1 = model1(inp)

TimeDistributed Layer applies the layer wrapped inside it to each timestep so the input shape to the dense_layer wrapped inside is (B, d_model), so after the applying the dense_layer with weights of shape (d_model, 16) the output is (B, 16), doing this for all time steps we get output of shape (B, T, 16).

# Without using TimeDistributed layer
inputs = tf.keras.Input(shape=(T, d_model)) # (B, T, d_model)
outputs = dense_layer(inputs) # (B, T, 16)
model2 = keras.Model(inputs, outputs)

otpt2 = model2(inp)

Without using the TimeDistributed the input shape to the dense_layer is (B, T, d_model) so the weights dimension is (d_model, 16) which applies to all the batches B to give output shape (B, T, 16)

np.all(otpt1.numpy() == otpt2.numpy()) # True
  • Dense without TimeDistributed computes perBatch
  • TimeDistributed with Dense computes per Timestep
VVY
  • 57
  • 8