3

I've looked everywhere and can't find anything that explains the actual derivation of backprop for residual layers. Here's my best attempt and where I'm stuck. It is worth mentioning that the derivation that I'm hoping for is from a generic perspective that need not be limited to convolutional NNs.

If the formula for calculating the output of a normal hidden layer is F(x) then the formula for a hidden layer with a residual connection is F(x) + o, where x is the weight adjusted output of a previous layer, o is the output of a previous layer, and F is the activation function. To get the delta for a normal layer during back-propagation one needs to calculate the gradient of the output ∂F(x)/∂x. For a residual layer this is ∂(F(x) + o)/∂x which is separable into ∂F(x)/∂x + ∂o/∂x (1).

If all of this is correct, how does one deal with ∂o/∂x? It seems to me that it depends on how far back in the network o comes from.

  • If o is just from the previous layer then o*w=x where w are the weights connecting the previous layer to the layer for F(x). Taking the derivative of each side relative to o gives ∂(o*w)/∂o = ∂x/∂o, and the result is w = ∂x/do which is just the inverse of the term that comes out at (1) above. Does it make sense that in this case the gradient of the residual layer is just ∂F(x)/∂x + 1/w ? Is it accurate to interpret 1/w as a matrix inverse? If so then is that actually getting computed by NN frameworks that use residual connections or is there some shortcut that is for adding in the error from the residual?

  • If o is from further back in the network then, I think, the derivation becomes slightly more complicated. Here is an example where the residual comes from one layer further back in a network. The network architecture is Input--w1--L1--w2--L2--w3--L3--Out, having a residual connection from the L1 to L3 layers. The symbol o from the first example is replaced by the layer output L1 for unambiguity. We are trying to calculate the gradient at L3 during back-prop which has a forward function of F(x)+L1 where x=F(F(L1*w2)*w3). The derivative of this relationship is ∂x/∂L1=∂F(F(L1*w2)*w3/∂L1, which is more complicated but doesn't seem too difficult to solve numerically.

If the above derivation is reasonable then it's worth noting that there is a case when the derivation fails, and that is when a residual connection originates from the Input layer. This is because the input cannot be broken down into a o*w=x expression (where x would be the input values). I think this must suggest that residual layers cannot originate from from the input layer, but since I've seen network architecture diagrams that have residual connections that originate from the input, this casts my above derivations into doubt. I can't see where I've gone wrong though. If anyone can provide a derivation or code sample for how they calculate the gradient at residual merge points correctly, I would be deeply grateful.

EDIT:

The core of my question is, when using residual layers and doing vanilla back-propagation, is there any special treatment of the error at the layers where residuals are added? Since there is a 'connection' between the layer where the residual comes from and the layer where it is added, does the error need to get distributed backwards over this 'connection'? My thinking is that since residual layers provide raw information from the beginning of the network to deeper layers, the deeper layers should provide raw error to the earlier layers.

Based on what I've seen (reading the first few pages of googleable forums, reading the essential papers, and watching video lectures) and Maxim's post down below, I'm starting to think that the answer is that ∂o/∂x = 0 and that we treat o as a constant.

Does anyone do anything special during back-prop through a NN with residual layers? If not, then does that mean residual layers are an 'active' part of the network on only the forward pass?

Maxim
  • 52,561
  • 27
  • 155
  • 209
Jacob Statnekov
  • 245
  • 3
  • 12
  • I see you've updated the question. Can you say what exactly you mean by `o`? – Maxim Oct 19 '17 at 13:58
  • When I define the relationship o*w=x, I say that o is from the previous layer. For clarity, I really should have said that o is the *output* from the previous layer. – Jacob Statnekov Oct 19 '17 at 23:17
  • Ok, in the paper and in my answer, `x` is the output of the previous layer. What's the meaning of ∂o/∂x then? – Maxim Oct 20 '17 at 13:18
  • The value/meaning of ∂o/∂x is precisely what my question is asking for. It appears in the backprop equation when taking the output gradient of a residual layer. There don't appear to be any worked example equations for residual backprop or else it would necessarily need to be addressed there. I'm sure someone has had to figure this out at some point, but I can't find anywhere that shows the derivation for how to generally handle the output gradient for residual layers. BTW, I appreciate you engaging with me on this. – Jacob Statnekov Oct 21 '17 at 05:32
  • I think you mixed up backprop gradients. ∂o/∂x is never calculated, but ∂x/∂o is. Read through [the math](https://medium.com/@erikhallstrm/backpropagation-from-the-beginning-77356edf427d) again and take note: there are only two types of partial derivatives - loss function wrt variable; function output wrt the input. Partial derivative other way around doesn't make sense. Residual connections are no exception. – Maxim Oct 22 '17 at 15:21
  • Thank you for the link Maxim, unfortunately it only goes over basic backprop, which I am already quite familiar with. I clearly show how ∂o/∂x appears while taking the gradient of residual layer. I feel like a careful re-reading of my question should clarify things for you. – Jacob Statnekov Oct 22 '17 at 23:55
  • 1
    OK, just to make sure, I've re-read the question. See the update to my answer. – Maxim Oct 23 '17 at 12:01

1 Answers1

3

I think you've over-complicated residual networks a little bit. Here's the link to the original paper by Kaiming He at al.

In section 3.2, they describe the "identity" shortcuts as y = F(x, W) + x, where W are the trainable parameters. You can see why it's called "identity": the value from the previous layer is added as is, without any complex transformation. This makes two things:

  • F now learns the residual y - x (discussed in 3.1), in short: it's easier to learn.
  • The network gets an extra connection to the previous layer, which improves the gradients flow.

The backward flow through the identity mapping is trivial: the error message is passed unchanged, no inverse matrices are involved (in fact, they are not involved in any linear layer).

Now, paper authors go a bit further and consider a slightly more complicated version of F, which changes the output dimensions (which probably you had in mind). They write it generally as y = F(x, W) + Ws * x, where Ws is the projection matrix. Note that, though it's written as matrix multiplication, this operation is in fact very simple: it adds extra zeros to x to make its shape larger. You can read a discussion of this operation in this question. But this does very few changes the backward: the error message is simply clipped to the original shape of x.

Maxim
  • 52,561
  • 27
  • 155
  • 209
  • I haven't created a new notation by using o instead of x since these mean different things. The input to the activation function (denoted as x) is the output of the previous layer (denoted o) multiplied by weights, there really should be no confusing them. You have created a convention around using x as a function for resolving matrix dimension mismatch; resolving a dimension mismatch is not part of my question and I believe is immaterial to the answer. – Jacob Statnekov Oct 24 '17 at 04:31
  • I believe I've just found the answer. Would you check figure 5 of https://arxiv.org/pdf/1603.05027v1.pdf and let me know if it makes sense to you? They write that ∂o/∂x is equal to 1. I believe the implication is that o can be treated as a simple variable, so it shouldn't be decomposed into operations on previous layer variables. – Jacob Statnekov Oct 24 '17 at 05:17
  • If you agree that that paper clears things up, then I think this question is solved. I'd like to mark your answer as correct in appreciation of your sticking with me on this question. Would you mind removing the parts of your answer that I've noted as a misunderstanding? If you reword to focus on where you write "that nothing special is done in backprop" (I assume you mean that it is indistinguishable from backprop over a non-residual layer, ie there are no additional terms in the layer-wise error calculation) and include the information in that paper, then I can give you credit. – Jacob Statnekov Oct 24 '17 at 05:22
  • Absolutely, I'll be glad to drop the `o`, once I understand what multiplication by weights you mean. What is `o` and `x=o*w` in this paper? – Maxim Oct 24 '17 at 07:43
  • The paper models the residual network as XL=Xl + Σi=lL-1 F(Xi,Wi). The sentence that this equation appears within provides term definitions. I provide two bullet points with the two different cases where o originates, either the previous layer, XL-1, or an even shallower layer, Xl. My shallow case naturally follows from the previous layer, XL-1, case, so there's no need to address that. You can treat o as XL-1 and get a complete answer. – Jacob Statnekov Oct 24 '17 at 20:43
  • @JacobStatnekov apologies for delay, I didn't give up on this. I'd like to note that this paper looks at very specific res-net. They set `f` to identity, i.e. non-linearity is only in residual connection `F`. In *this setting*, eq. (5) is true and h'=∂h(x_L)/∂x_L=1 (using notation from the paper). Gradient decomposition is very straightforward and the gradient in all res-connections is 1. Note that `x` in my answer is x_{L-1} and my `y` is x_{L}, and eq. (3) corresponds to my 1st eq. So yes, it makes sense to me. I've removed the `o` part from my answer. – Maxim Nov 14 '17 at 17:43