Why not use Flatten followed by a Dense layer instead of TimeDistributed?

Question

I am trying to understand the Keras layers better. I am working on a sequence to sequence model where I embed a sentence and pass it to a LSTM that returns sequences. Hereafter, I want to apply a Dense layer to each timestep (word) in the sentence and it seems like TimeDistributed does the job for three-dimensional tensors like this case.

In my understanding, Dense layers only work for two-dimensional tensors and TimeDistributed just applies the same dense on every timestep in three dimensions. Could one then not simply flatten the timesteps, apply a dense layer and perform a reshape to obtain the same result or are these not equivalent in some way that I am missing?

Then you would have a big dense layer with different parameters for each time step, instead of applying the same single-timestep dense layer to every timestep in the input. — jdehesa, Dec 07 '18 at 13:38
I assume the dense-layer in some way has to be connected to every timestep in order to update weights on back-prop? I believe my failure to grasp the concept properly lies in the fact that I cannot visualize the approaches. — Andreas Olesen, Dec 09 '18 at 21:35
As mentioned below by @Andrey Kite Gorin, Dense layers can be applied 3D tensors and they do exactly what you like to do. I think there was some earlier versions of Keras where you had to use TimeDistributed as Dense was only applicable to 2D tensor and that's why some tutorials out there still have it. — SaTa, Feb 03 '19 at 21:23

score 7 · Answer 1 · answered Dec 09 '18 at 22:59

Imagine you have a batch of 4 time steps, each containing a 3-element vector. Let's represent that with this:

Now you want to transform this batch using a dense layer, so you get 5 features per time step. The output of the layer can be represented as something like this:

You consider two options, a TimeDistributed dense layer, or reshaping as a flat input, apply a dense layer and reshaping back to time steps.

In the first option, you would apply a dense layer with 3 inputs and 5 outputs to every single time step. This could look like this:

Each blue circle here is a unit in the dense layer. By doing this with every input time step you get the total output. Importantly, these five units are the same for all the time steps, so you only have the parameters of a single dense layer with 3 inputs and 5 outputs.

The second option would involve flattening the input into a 12-element vector, applying a dense layer with 12 inputs and 20 outputs, and then reshaping that back. This is how it would look:

Here the input connections of only one unit are drawn for clarity, but every unit would be connected to every input. Here, obviously, you have many more parameters (those of a dense layer with 12 inputs and 20 outputs), and also note that each output value is influenced by every input value, so values in one time step would affect outputs in other time steps. Whether this is something good or bad depends on your problem and model, but it is an important difference with respect to the previous, where each time step input and output were independent. In addition to that, this configuration requires you to use a fixed number of time steps on each batch, whereas the previous works independently of the number of time steps.

You could also consider the option of having four dense layers, each applied independently to each time step (I didn't draw it but hopefully you get the idea). That would be similar to the previous one, only each unit would receive input connections only from its respective time step inputs. I don't think there is a straightforward way to do that in Keras, you would have to split the input into four, apply dense layers to each part and merge the outputs. Again, in this case the number of time steps would be fixed.

score 5 · Accepted Answer · answered Feb 02 '19 at 13:35

Dense layer can act on any tensor, not necessarily rank 2. And I think that TimeDistributed wrapper does not change anything in the way Dense layer acts. Just applying Dense layer to a tensor of rank 3 will do exactly the same as applying TimeDistributed wrapper of the Dense layer. Here is illustration:

from tensorflow.keras.layers import *
from tensorflow.keras.models import *

model = Sequential()

model.add(Dense(5,input_shape=(50,10)))

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_5 (Dense)              (None, 50, 5)             55        
=================================================================
Total params: 55
Trainable params: 55
Non-trainable params: 0
_________________________________________________________________

model1 = Sequential()

model1.add(TimeDistributed(Dense(5),input_shape=(50,10)))

model1.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
time_distributed_3 (TimeDist (None, 50, 5)             55        
=================================================================
Total params: 55
Trainable params: 55
Non-trainable params: 0
_________________________________________________________________

score 2 · Answer 3 · answered Sep 09 '19 at 14:16

2

Adding to the above answers, here are few pictures comparing the output shapes of the two layers. So when using one of these layers after LSTM(for example) would have different behaviors.

answered Sep 09 '19 at 14:16

yuva-rajulu

406
5
13

Alexis · Answer 4 · 2022-11-20T23:36:34.810

"Could one then not simply flatten the timesteps, apply a dense layer and perform a reshape to obtain the same result"

No, flattening timesteps into input dimensions (input_dim) is the wrong operation. As illustrated by yuva-rajulu if you flatten a 3D input (batch_size,timesteps,input_dim) = (1000,50,10), you end up with a flattened input (batch_size,input_dim)=(1000,500), resulting in a network architecture with timesteps interacting with each others (see jdehesa). This is not what is intended (i.e., we want to apply the same dense layer to each timestep independently).

What need to be done instead is to reshape the 3D input as (batch_size * timesteps, input_dim) = (50000,10), then apply the dense layer on this 2D input. That way the same dense layer will operate 50000 times on each input vector (10,1) independently. You will end up with a (50000,n_units) output that you should reshape back as a (1000,50,n_units) output. Fortunately, when you pass a 3D input to a dense layer keras does this automatically for you. See official reference:

"If the input to the layer has a rank greater than 2, then Dense computes the dot product between the inputs and the kernel along the last axis of the inputs and axis 0 of the kernel (using tf.tensordot). For example, if input has dimensions (batch_size, d0, d1), then we create a kernel with shape (d1, units), and the kernel operates along axis 2 of the input, on every sub-tensor of shape (1, 1, d1) (there are batch_size * d0 such sub-tensors). The output in this case will have shape (batch_size, d0, units)."

Another way to see it is that the way Dense() computes the output is simply by applying the kernel , i.e., weigth matrix of size (input_dim, n_units) to the last dimension of your 3D input, considering all other dimensions as similar to batch sizes, then size the output accordingly.

I think that they may have been a time when the TimeDistributed layer was needed in keras with Dense() discussion here. Today, we do not need the TimeDistributed wrapper as Dense() and TimeDistributed(Dense()) do exactly the same thing, see Andrey Kite Gorin or mujjiga.

Why not use Flatten followed by a Dense layer instead of TimeDistributed?

4 Answers4