3

First, I have read this and this questions with similar names to mine and still do not have an answer.

I want to build a feedforward network for sequence prediction. (I realize that RNNs are more suitable for this task, but I have my reasons). The sequences are of length 128 and each element is a vector with 2 entries, so each batch should be of shape (batch_size, 128, 2) and the target is the next step in the sequence, so the target tensor should be of shape (batch_size, 1, 2).

The network architecture is something like this:

    model = Sequential()
    model.add(Dense(50, batch_input_shape=(None, 128, 2), kernel_initializer="he_normal" ,activation="relu"))
    model.add(Dense(20, kernel_initializer="he_normal", activation="relu"))
    model.add(Dense(5, kernel_initializer="he_normal", activation="relu"))
    model.add(Dense(2))

But trying to train I get the following error:

ValueError: Error when checking target: expected dense_4 to have shape (128, 2) but got array with shape (1, 2)

I've tried variations like:

model.add(Dense(50, input_shape=(128, 2), kernel_initializer="he_normal" ,activation="relu"))

but get the same error.

today
  • 32,602
  • 8
  • 95
  • 115
H.Rappeport
  • 517
  • 6
  • 17
  • I don't know the task but there are several ways to arrange feedforward nns with (128,2) as input and (1,2) as output. Maybe you can explain why you think it is a sequence of 128 with 2 vectors at a time instead of a sequence of 2 with 128 vectors at a time? – Mehdi Oct 06 '18 at 17:41
  • Since these are time series with 128 steps (each step a 2-entry vector) each and I'd like to retain the temporal relation – H.Rappeport Oct 06 '18 at 17:51
  • It seems you are looking for the`TimeDistributed` wrapper, but you anwer confuses me. If there are 128 steps in time and 2 vector at each time, what is the dimensionality of the vectors? and where in you network you want to colaps time dimension into 1 step? are these two enteries compose with eachother? or they need to be kept separately? – Mehdi Oct 06 '18 at 17:56
  • @Mehdi Thank you for mentioning TimeDistributed, I was not familiar with this wrapper. Perhaps my wording was problematic, I meant each sequence is of the shape [[x1_1, x1_2], [x2_1, x1_2], [x2_1, x1_2], ...124 steps..., [x128_1, x128_2]]. and x_1 and x_2 are independent. – H.Rappeport Oct 06 '18 at 18:26
  • 1
    @Mehdi @H.Rappeport Just a side note: as I mentioned in my answer, since [Dense layer is applied on the last axis](https://stackoverflow.com/a/52092176/2099607), there is no difference between `TimeDistributed(Dense(...))` and `Dense(...)`. – today Oct 06 '18 at 19:27

1 Answers1

8

If you take a look at the model.summary() output you will see that what the issue is:

Layer (type)                 Output Shape              Param #   
=================================================================
dense_13 (Dense)             (None, 128, 50)           150       
_________________________________________________________________
dense_14 (Dense)             (None, 128, 20)           1020      
_________________________________________________________________
dense_15 (Dense)             (None, 128, 5)            105       
_________________________________________________________________
dense_16 (Dense)             (None, 128, 2)            12        
=================================================================
Total params: 1,287
Trainable params: 1,287
Non-trainable params: 0
_________________________________________________________________

As you can see, the output of the model is (None, 128,2) and not (None, 1, 2) (or (None, 2)) as you expected. So, you may or may not know that Dense layer is applied on the last axis of its input array and as a result, as you see above, the time axis and dimension is preserved until the end.

How to resolve this? You mentioned you don't want to use a RNN layer, therefore you have two options: you need to either use Flatten layer somewhere in the model or you can also use some Conv1D + Pooling1D layers or even a GlobalPooling layer. For example (these are just for demonstration, you may do it differently):

using Flatten layer

model = models.Sequential()
model.add(Dense(50, batch_input_shape=(None, 128, 2), kernel_initializer="he_normal" ,activation="relu"))
model.add(Dense(20, kernel_initializer="he_normal", activation="relu"))
model.add(Dense(5, kernel_initializer="he_normal", activation="relu"))
model.add(Flatten())
model.add(Dense(2))

model.summary()

Model summary:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_17 (Dense)             (None, 128, 50)           150       
_________________________________________________________________
dense_18 (Dense)             (None, 128, 20)           1020      
_________________________________________________________________
dense_19 (Dense)             (None, 128, 5)            105       
_________________________________________________________________
flatten_1 (Flatten)          (None, 640)               0         
_________________________________________________________________
dense_20 (Dense)             (None, 2)                 1282      
=================================================================
Total params: 2,557
Trainable params: 2,557
Non-trainable params: 0
_________________________________________________________________

using GlobalAveragePooling1D layer

model = models.Sequential()
model.add(Dense(50, batch_input_shape=(None, 128, 2), kernel_initializer="he_normal" ,activation="relu"))
model.add(Dense(20, kernel_initializer="he_normal", activation="relu"))
model.add(GlobalAveragePooling1D())
model.add(Dense(5, kernel_initializer="he_normal", activation="relu"))
model.add(Dense(2))

model.summary()

​Model summary:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_21 (Dense)             (None, 128, 50)           150       
_________________________________________________________________
dense_22 (Dense)             (None, 128, 20)           1020      
_________________________________________________________________
global_average_pooling1d_2 ( (None, 20)                0         
_________________________________________________________________
dense_23 (Dense)             (None, 5)                 105       
_________________________________________________________________
dense_24 (Dense)             (None, 2)                 12        
=================================================================
Total params: 1,287
Trainable params: 1,287
Non-trainable params: 0
_________________________________________________________________

Note that in both cases above you need to reshape the labels (i.e. targets) array to (n_samples, 2) (or you may want to use a Reshape layer at the end).

today
  • 32,602
  • 8
  • 95
  • 115
  • Thank you. Followup question (which may or may not deserve asking a separate question): What is the effect of the location of the flatten layer? Would it make any difference if it were placed before the first layer or the last? – H.Rappeport Oct 06 '18 at 18:29
  • @H.Rappeport Of course it does. It changes the connections. As I mentioned in my answer, `Dense` layer is applied on the last axis and the weights **are shared (i.e. same weights are applied)**. To clarify this take a look at the second Dense layer in the models summaries above. You see that it has 1020 parameters: 50 parameters per each of those units (50 * 20 = 1000) which is connected to the each row (i.e. 50 elements) in the previous layer, plus one bias parameter per unit (20). Now put a Flatten layer before this layer: the number of parameters would be 128*50*20+20. Totally different! – today Oct 06 '18 at 19:23