1

I've recently read the RetinaNet paper and I have yet to understood one minor detail:
We have the multi-scale feature maps obtained from the FPN (P2,...P7).
Then the two FCN heads (the classifier head and regessor head) are convolving each one of the feature maps.
However, each feature map has different spatial scale, so, how does the classifier head and regressor head maintain fixed output volumes, given all their convolution parameters are fix? (i.e. 3x3 filter with stride 1, etc).

Looking at this line at PyTorch's implementation of RetinaNet, I see the heads just convolve each feature and then all features are stacked somehow (the only common dimension between them is the Channel dimension which is 256, but spatially they are double from each other).
Would love to hear how are they combined, I wasn't able to understand that point.

Jjang
  • 11,250
  • 11
  • 51
  • 87

1 Answers1

1

After the convolution at each pyramid step, you reshape the outputs to be of shape (H*W, out_dim) (with out_dim being num_classes * num_anchors for the class head and 4 * num_anchors for the bbox regressor). Finally, you can concatenate the resulting tensors along the H*W dimension, which is now possible because all the other dimensions match, and compute losses as you would on a network with a single feature layer.

GPhilo
  • 18,519
  • 9
  • 63
  • 89
  • Thanks, that makes more sense. However, how is that reshaping is being done? Where can I see that step indeed occurs (whether in the paper / some implementation)? – Jjang May 12 '20 at 20:16
  • In the paper is not explicitly mentioned, as it's a mere implementation detail. In principle, you could also avoid the concatenation and just calculate the loss separately for each pyramid level, adding everything at the end (which, as far as I can tell, is the approach used by the `keras-retinanet` implementation). As per how that is done, it's a simple `Reshape` op. I think TF's implementation in of SSD+FPN does the reshaping and concatenation, you might have a look to their code (which, however, is quite complex to read because of how the object detection API is arranged) – GPhilo May 15 '20 at 14:03