First of all, the shapes of tensors in the inception layer are not like you define. 1x1
, 1x3
and 3x1
are the shapes of the filters applied to the image. There are two more parameters in convolution: padding and striding, and depending on their exact values, the result shape can be very different.
In this particular case, the spatial shape doesn't change, only the channels
dimension will be 2048
and 256
, that's why they can be concatenated. The concatenation of your original t1
and t2
will result in error.
Is this the correct way to implement a concatenation layer?
Yes, feature map concatenation is one of key ideas of inception network and its implementation indeed uses tf.concat
(e.g. see inception v1 source code).
Note that this tensor will grow in one direction (channels / features), but contract in spatial dimensions because of downsampling, so it won't get too large. Also note that this tensor is the transformed input data (image), hence unlike the weights, it's not initialized, but rather flows through the network. The weights will be the tensors 1x1x2048=2048
, 1x3x224=672
, 3x1x256=768
, etc - as you can see they are not very big at all, and that's another idea of the inception network.