36

I'm building a model that converts a string to another string using recurrent layers (GRUs). I have tried both a Dense and a TimeDistributed(Dense) layer as the last-but-one layer, but I don't understand the difference between the two when using return_sequences=True, especially as they seem to have the same number of parameters.

My simplified model is the following:

InputSize = 15
MaxLen = 64
HiddenSize = 16

inputs = keras.layers.Input(shape=(MaxLen, InputSize))
x = keras.layers.recurrent.GRU(HiddenSize, return_sequences=True)(inputs)
x = keras.layers.TimeDistributed(keras.layers.Dense(InputSize))(x)
predictions = keras.layers.Activation('softmax')(x)

The summary of the network is:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 64, 15)            0         
_________________________________________________________________
gru_1 (GRU)                  (None, 64, 16)            1536      
_________________________________________________________________
time_distributed_1 (TimeDist (None, 64, 15)            255       
_________________________________________________________________
activation_1 (Activation)    (None, 64, 15)            0         
=================================================================

This makes sense to me as my understanding of TimeDistributed is that it applies the same layer at all timepoints, and so the Dense layer has 16*15+15=255 parameters (weights+biases).

However, if I switch to a simple Dense layer:

inputs = keras.layers.Input(shape=(MaxLen, InputSize))
x = keras.layers.recurrent.GRU(HiddenSize, return_sequences=True)(inputs)
x = keras.layers.Dense(InputSize)(x)
predictions = keras.layers.Activation('softmax')(x)

I still only have 255 parameters:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 64, 15)            0         
_________________________________________________________________
gru_1 (GRU)                  (None, 64, 16)            1536      
_________________________________________________________________
dense_1 (Dense)              (None, 64, 15)            255       
_________________________________________________________________
activation_1 (Activation)    (None, 64, 15)            0         
=================================================================

I wonder if this is because Dense() will only use the last dimension in the shape, and effectively treat everything else as a batch-like dimension. But then I'm no longer sure what the difference is between Dense and TimeDistributed(Dense).

Update Looking at https://github.com/fchollet/keras/blob/master/keras/layers/core.py it does seem that Dense uses the last dimension only to size itself:

def build(self, input_shape):
    assert len(input_shape) >= 2
    input_dim = input_shape[-1]

    self.kernel = self.add_weight(shape=(input_dim, self.units),

It also uses keras.dot to apply the weights:

def call(self, inputs):
    output = K.dot(inputs, self.kernel)

The docs of keras.dot imply that it works fine on n-dimensional tensors. I wonder if its exact behavior means that Dense() will in effect be called at every time step. If so, the question still remains what TimeDistributed() achieves in this case.

cseprog
  • 557
  • 1
  • 5
  • 9
  • Let me add that the two models behaved in almost exactly the same way during training. – cseprog Jun 18 '17 at 19:32
  • I have always been wondering about it too. So you confirmed that Dense() and TimeDistributed(Dense()) have the same performance in your case? I think a better design of the API would be allowing the users to set a parameter, whether to use the same Dense layer over timesteps or separate Dense layers for each timestep. – ymeng Aug 04 '17 at 00:55
  • In your case, Dense and Timedistributed(Dense) have the same result according to your update(Update Looking at https://github.com/fchollet/keras/blob/master/keras/layers/core.py it does seem that Dense uses the last dimension only to size itself). – Jindong Chen Apr 08 '18 at 01:46

2 Answers2

25

TimeDistributedDense applies a same dense to every time step during GRU/LSTM Cell unrolling. So the error function will be between predicted label sequence and the actual label sequence. (Which is normally the requirement for sequence to sequence labeling problems).

However, with return_sequences=False, Dense layer is applied only once at the last cell. This is normally the case when RNNs are used for classification problem. If return_sequences=True then Dense layer is applied to every timestep just like TimeDistributedDense.

So for as per your models both are same, but if you change your second model to return_sequences=False, then Dense will be applied only at the last cell. Try changing it and the model will throw as error because then the Y will be of size [Batch_size, InputSize], it is no more a sequence to sequence but a full sequence to label problem.

from keras.models import Sequential
from keras.layers import Dense, Activation, TimeDistributed
from keras.layers.recurrent import GRU
import numpy as np

InputSize = 15
MaxLen = 64
HiddenSize = 16

OutputSize = 8
n_samples = 1000

model1 = Sequential()
model1.add(GRU(HiddenSize, return_sequences=True, input_shape=(MaxLen, InputSize)))
model1.add(TimeDistributed(Dense(OutputSize)))
model1.add(Activation('softmax'))
model1.compile(loss='categorical_crossentropy', optimizer='rmsprop')


model2 = Sequential()
model2.add(GRU(HiddenSize, return_sequences=True, input_shape=(MaxLen, InputSize)))
model2.add(Dense(OutputSize))
model2.add(Activation('softmax'))
model2.compile(loss='categorical_crossentropy', optimizer='rmsprop')

model3 = Sequential()
model3.add(GRU(HiddenSize, return_sequences=False, input_shape=(MaxLen, InputSize)))
model3.add(Dense(OutputSize))
model3.add(Activation('softmax'))
model3.compile(loss='categorical_crossentropy', optimizer='rmsprop')

X = np.random.random([n_samples,MaxLen,InputSize])
Y1 = np.random.random([n_samples,MaxLen,OutputSize])
Y2 = np.random.random([n_samples, OutputSize])

model1.fit(X, Y1, batch_size=128, nb_epoch=1)
model2.fit(X, Y1, batch_size=128, nb_epoch=1)
model3.fit(X, Y2, batch_size=128, nb_epoch=1)

print(model1.summary())
print(model2.summary())
print(model3.summary())

In the above example architecture of model1 and model2 are sample (sequence to sequence models) and model3 is a full sequence to label model.

Melike
  • 468
  • 1
  • 7
  • 15
mujjiga
  • 16,186
  • 2
  • 33
  • 51
  • Thank you for the answer. I'm not sure I can follow though as I know that the output in both cases is a sequence. In both cases the recurrent layer has return_sequences=True, and the output shape in both cases is 3D and is exactly the same (batch_size, 64, 15). So it seems to me that the Dense layer is also applied at every time step. – cseprog Jun 18 '17 at 19:31
  • I have updated my answer with better explanation, hope it helps you. – mujjiga Jun 19 '17 at 05:25
  • 1
    Thank you. For avoidance of doubt, when you say "So for as per your models both are same, but if u change your second model to "return_sequences=True" then the Dense will be applied only at the last cell." - do you mean if I change return_sequences to False? Your answer seems to imply that if return_sequences is True, Dense() and TimeDistributed(Dense()) do exactly the same thing. Could you confirm this? This would make sense, but then why does Keras need TimeDistributed() at all? – cseprog Jun 19 '17 at 10:15
  • Sorry for typo I have corrected it now, yes it should be "return_sequences=False". As per keras documentation "...This wrapper (TimeDistributed) allows to apply a layer to every temporal slice of an input." Using TimeDistributed you can apply the same layer but RNNs are by default unrolled in time. See this code snippet "model = Sequential() model.add(TimeDistributed(Conv2D(64, (3, 3)), input_shape=(10, 299, 299, 3)))". You cant achieve this using a simple dense layer without TimeDistributed wrapper" – mujjiga Jun 19 '17 at 12:01
  • 1
    Thanks again. Yes, I agree that one needs TimeDistributed() for other layer types. It seems to me that a simple Dense() after a recurrent layer that returns sequences does work, but more by accident than by design. From old Keras examples I think there used to be a TimeDistributedDense() - it's still a mystery to me why it was needed if Dense() would have worked anyway. – cseprog Jun 19 '17 at 15:13
  • 1
    Hi @thon, I've come to the same conclusion. It's also very strange because Dense() should flatten the input dimensions if > 2 as cited in the doc: "Note: if the input to the layer has a rank greater than 2, then it is flattened prior to the initial dot product with kernel". Have you found the answer? – Gengiolo Sep 22 '17 at 19:16
  • Hi, thanks for the comment. No, unfortunately I haven't managed to find out anything more about TimeDistributed vs Dense. – cseprog Sep 23 '17 at 21:06
0

Here is a piece of code that verifies TimeDistirbuted(Dense(X)) is identical to Dense(X):

import numpy as np 
from keras.layers import Dense, TimeDistributed
import tensorflow as tf

X = np.array([ [[1, 2, 3],
                [4, 5, 6],
                [7, 8, 9],
                [10, 11, 12]
               ],
               [[3, 1, 7],
                [8, 2, 5],
                [11, 10, 4],
                [9, 6, 12]
               ]
              ]).astype(np.float32)
print(X.shape)

(2, 4, 3)

dense_weights = np.array([[0.1, 0.2, 0.3, 0.4, 0.5],
                          [0.2, 0.7, 0.9, 0.1, 0.2],
                          [0.1, 0.8, 0.6, 0.2, 0.4]])
bias = np.array([0.1, 0.3, 0.7, 0.8, 0.4])
print(dense_weights.shape)

(3, 5)

dense = Dense(input_dim=3, units=5, weights=[dense_weights, bias])
input_tensor = tf.Variable(X, name='inputX')
output_tensor1 = dense(input_tensor)
output_tensor2 = TimeDistributed(dense)(input_tensor)
print(output_tensor1.shape)
print(output_tensor2.shape)

(2, 4, 5)

(2, ?, 5)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    output1 = sess.run(output_tensor1)
    output2 = sess.run(output_tensor2)

print(output1 - output2)

And the difference is:

[[[0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0.]]

 [[0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0.]
  [0. 0. 0. 0. 0.]]]
user263387
  • 457
  • 4
  • 5
  • 2
    This is false; session graph results aren't equivalent to the full model graph results - latter involves gradients and weight updates. From [documentation](https://keras.io/layers/core/), _"if the input to the layer has a rank greater than 2, then it is flattened prior to the initial dot product with kernel"_ - contrasting `TimeDistributedDense`, which doesn't flatten. [Counterexample code](https://puu.sh/Edt3a/5f25208312.txt) – OverLordGoldDragon Sep 04 '19 at 22:40