What is the difference between "+=" operator and normal adding when implementing a residual connection in pytorch?

Question

I was trying to create a Transformer Encoder Block and in the forward() function i used "+=" operator:

    def forward(self, x):
        x += self.msa_block(x)
        x += self.mlp_block(x)
        
        return x

Then i got an error message:

RuntimeError: one of the variables needed for gradient computation has been modified 
by an inplace operation: [torch.FloatTensor [32, 197, 768]], which is output 0 of AddBackward0, is at version 24; expected version 23 instead. 
Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

After some search and trial-error i found out that the forward() function should be like this:

    def forward(self, x):
        x = self.msa_block(x) + x
        x = self.mlp_block(x) + x
        
        return x

As i understood, the problem was gradient computing(backward()). My question is what caused gradient not compute?

Does this answer your question? [How are Python in-place operator functions different than the standard operator functions?](https://stackoverflow.com/questions/4772987/how-are-python-in-place-operator-functions-different-than-the-standard-operator) — Mechanic Pig, Jan 03 '23 at 11:50

What is the difference between "+=" operator and normal adding when implementing a residual connection in pytorch?

0 Answers0