9

I am reading through Residual learning, and I have a question. What is "linear projection" mentioned in 3.2? Looks pretty simple once got this but could not get the idea...

Can someone provide simple example?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Troy
  • 95
  • 1
  • 4
  • 2
    I’m voting to close this question because it is not about programming as defined in the [help] but about ML theory and/or methodology - please see the intro and NOTE in https://stackoverflow.com/tags/machine-learning/info – desertnaut May 05 '22 at 21:10

3 Answers3

9

First up, it's important to understand what x, y and F are and why they need any projection at all. I'll try explain in simple terms, but basic understanding of ConvNets is required.

x is an input data (called tensor) of the layer, in case of ConvNets it's rank is 4. You can think of it as a 4-dimensional array. F is usually a conv layer (conv+relu+batchnorm in this paper), and y combines the two together (forming the output channel). The result of F is also of rank 4, and most of dimensions will be the same as in x, except for one. That's exactly what the transformation should patch.

For example, x shape might be (64, 32, 32, 3), where 64 is the batch size, 32x32 is image size and 3 stands for (R, G, B) color channels. F(x) might be (64, 32, 32, 16): batch size never changes, for simplicity, ResNet conv-layer doesn't change the image size too, but will likely use a different number of filters - 16.

So, in order for y=F(x)+x to be a valid operation, x must be "reshaped" from (64, 32, 32, 3) to (64, 32, 32, 16).

I'd like to stress here that "reshaping" here is not what numpy.reshape does.

Instead, x[3] is padded with 13 zeros, like this:

pad(x=[1, 2, 3],padding=[7, 6]) = [0, 0, 0, 0, 0, 0, 0, 1, 2, 3, 0, 0, 0, 0, 0, 0]

If you think about it, this is a projection of a 3-dimensional vector onto 16 dimensions. In other words, we start to think that our vector is the same, but there are 13 more dimensions out there. None of the other x dimensions are changed.

Here's the link to the code in Tensorflow that does this.

Maxim
  • 52,561
  • 27
  • 155
  • 209
  • 1
    Thank you very much! I have mainly used matlab rather than python, and there might be a misunderstanding I guess. In matlab the last dimension is # of image in python the first dimension is # of image. – Troy Sep 21 '17 at 14:31
  • Got you. The order may be different, but the projection should be done like described in the answer. – Maxim Sep 21 '17 at 14:46
  • Please disregard above one. – Troy Sep 21 '17 at 14:50
  • 1
    Ah... again... habitually punched enter. The below is my actual question: Thank you very much, but I still have questions. I blame my clumsiness. My first questions is about width and height reduction in a residual connection. For example, they used stride of 2 which will lead to width and height reduction rather than number of filters. This was my first question. After I got your lessons, I realized different number of channels also cause problem. Could you provide me another lesson about them?. I mean different (width and height), and (different channel) in residual connection? – Troy Sep 21 '17 at 14:50
  • 2
    Good question, but not enough space here to fully answer it. In short: when the layer downsamples the image (by using `strides=2`), `x` goes through a pooling layer **as well** with same stride. So both `F(x)` and `x` reduce the size of an image by half, and just like before only the "channel" dimension needs to be projected. I could only find an example in python: https://github.com/tflearn/tflearn/blob/master/examples/images/residual_network_mnist.py You can see two layers with `downsample=True`, both of which scale down the image. – Maxim Sep 21 '17 at 15:45
  • The link to tensorflow code wasn't anchored to a specific commit so I believe it is now pointing to the wrong line of code as a result of changes on the master branch – Xander Dunn Dec 06 '20 at 19:55
1

A linear projection is one where each new feature is simple a weighted sum of the original features. As in the paper, this can be represented by matrix multiplication. if x is the vector of N input features and W is an M-byN matrix, then the matrix product Wx yields M new features where each one is a linear projection of x. Each row of W is a set of weights that defines one of the M linear projections (i.e., each row of W contains the coefficients for one of the weighted sums of x).

bogatron
  • 18,639
  • 6
  • 53
  • 47
  • Thank you for your kind explanations. Please confirm if I correctly understand. If an input x has 3X3 and we want to project it to 4X4. Than, we vectorize x[3X3] to [9X1]. and the W will be [16X9]. Therefore, the W [16X9] x [9X1] = [16X1], and reshape it to [4X4]. Is this what you explained? – Troy Sep 09 '17 at 20:33
  • Yes, you got it. – bogatron Sep 09 '17 at 21:12
  • 1
    @W.Choi this answer is technically correct, but a bit misleading, as can be seen by your comment. Please see my answer. – Maxim Sep 20 '17 at 13:34
0

In Pytorch (in particular torchvision\models\resnet.py), at the end of a Bottleneck you will either have two scenarios

  1. The input vector x's channels, say x_c (not spatial resolution, but channels), are less than equal to the output after layer conv3 of the Bottleneck, say d dimensions. This can then be alleviated by a 1 by 1 convolution with in planes = x_c and out_planes = d, with stride 1, followed by batch normalization, and then the addition F(x) + x occurs assuming x and F(x) have the same spatial resolution.

  2. Both the spatial resolution of x and its number of channels don't match the output of the BottleNeck layer, in which case the 1 by 1 convolution mentioned above needs to have stride 2 in order for both the spatial resolution and the number of channels to match for the element-wise addition (again with batch normalization of x before the addition).

IntegrateThis
  • 853
  • 2
  • 16
  • 39