Keras Dense layer's input is not flattened

Question

This is my test code:

from keras import layers
input1 = layers.Input((2,3))
output = layers.Dense(4)(input1)
print(output)

The output is:

<tf.Tensor 'dense_2/add:0' shape=(?, 2, 4) dtype=float32>

But What Happend?

The documentation says:

Note: if the input to the layer has a rank greater than 2, then it is flattened prior to the initial dot product with kernel.

While the output is reshaped?

The documentation suprises me. I always thought the Dense layer will be calculated on the last axis while letting the other axis intact. — dennis-w, Aug 30 '18 at 06:49

today · Accepted Answer · 2019-04-12T23:10:56.980

Currently, contrary to what has been stated in documentation, the Dense layer is applied on the last axis of input tensor:

Contrary to the documentation, we don't actually flatten it. It's applied on the last axis independently.

In other words, if a Dense layer with m units is applied on an input tensor of shape (n_dim1, n_dim2, ..., n_dimk) it would have an output shape of (n_dim1, n_dim2, ..., m).

As a side note: this makes TimeDistributed(Dense(...)) and Dense(...) equivalent to each other.

Another side note: be aware that this has the effect of shared weights. For example, consider this toy network:

model = Sequential()
model.add(Dense(10, input_shape=(20, 5)))

model.summary()

The model summary:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_1 (Dense)              (None, 20, 10)            60        
=================================================================
Total params: 60
Trainable params: 60
Non-trainable params: 0
_________________________________________________________________

As you can see the Dense layer has only 60 parameters. How? Each unit in the Dense layer is connected to the 5 elements of each row in the input with the same weights, therefore 10 * 5 + 10 (bias params per unit) = 60.

Update. Here is a visual illustration of the example above:

Could someone draw me a picture? Tricky to wrap my head around precisely where the connections and shared weighs are as a new-comer to the library. — Zero, Mar 14 '19 at 05:27
Sorry to revive this topic, as I don't think I should ask a new question for this. How does the shared weight effect impact the performance? And how does it impact the performance in reinforcement learning? — Pedro Pablo Severin Honorato, Aug 05 '20 at 19:04
@PedroPabloSeverinHonorato That's a very broad question and the answer entirely depends on the specific problem as well as the architecture of the model. In general though, we can say that weight sharing decreases the number of parameters; this in turn makes the model smaller and therefore may speed up training/testing of the model. However, there is no guarantee that weight sharing would always increase the accuracy of the model as well. There are various ways and patterns of weight sharing which may or may not work or be beneficial in a specific problem instance or model. — today, Aug 05 '20 at 19:13

Keras Dense layer's input is not flattened

1 Answers1

Linked