3

Upon running the code snippet (PyTorch 1.7.1; Python 3.8),

import numpy as np
import torch

def batch_matrix(vector_pairs, factor=2):
    baselen = len(vector_pairs[0]) // factor
    split_batch = []

    for j in range(factor):
        for i in range(factor):
            start_j = j * baselen
            end_j = (j+1) * baselen if j != factor - 1 else None
            start_i = i * baselen
            end_i = (i+1) * baselen if i != factor - 1 else None

            mini_pairs = vector_pairs[start_j:end_j, start_i:end_i, :]
            split_batch.append(mini_pairs)
    return split_batch

def concat_matrix(vectors_):
    vectors = vectors_.clone()
    seq_len, dim_vec = vectors.shape
    project_x = vectors.repeat((1, 1, seq_len)).reshape(seq_len, seq_len, dim_vec)
    project_y = project_x.permute(1, 0, 2)
    matrix = torch.cat((project_x, project_y), dim=-1)
    matrix_ = matrix.clone()

    return matrix_

if __name__ == "__main__":
    vector_list = []
    for i in range(10):
        vector_list.append(torch.randn((5,), requires_grad=True))
    vectors = torch.stack(vector_list, dim=0)
    pmatrix = concat_matrix(vectors)

    factor = np.ceil(vectors.shape[0]/6).astype(int)
    batched_feats = batch_matrix(pmatrix, factor=factor)

    for i in batched_feats:
        i = i + 5
        print(i.shape)
        summed = torch.sum(i)
        summed.backward()

I get the output and error as below:

torch.Size([5, 5, 10])
torch.Size([5, 5, 10])
Traceback (most recent call last):
  File "/home/user/PycharmProjects/project/run.py", line 43, in <module>
    summed.backward()
  File "/home/user/anaconda3/envs/diff/lib/python3.8/site-packages/torch/tensor.py", line 221, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/user/anaconda3/envs/diff/lib/python3.8/site-packages/torch/autograd/__init__.py", line 130, in backward
    Variable._execution_engine.run_backward(
RuntimeError: Trying to backward through the graph a second time, but the saved intermediate results have already been freed. Specify retain_graph=True when calling backward the first time.

I have read all the existing posts on the issue and could not resolve it myself. Passing retain_graph=True in backward() fixes the issue in the provided snippet, however, the snippet is only an oversimplified version of a large network where retain_graph=True changes the error to the following:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [3000, 512]], which is output 0 of TBackward, is at version 3; expected version 2 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

I tried setting torch.autograd.set_detect_anomaly(True) and determining the point of failure, but all that I tried failed and the error persisted.

I suspect that if I can understand the cause of error in the current situation then it will help me resolve this error in actual codebase.

Therefore, I want to understand why is it that backward() works fine for first two tensors in batched_feats, while fails for the third one? I would really appreciate if someone can help me see the reuse of an intermediate result that has been freed.

Thanks a lot!

  • @bobcat Thanks for the link. However, I have seen that question, and even though it's the same error, the context is different. That question has a GRU whose states, not being detached, required backward twice. That doesn't apply to my question. I fail to understand how is it that backward is called twice on a freed tensor in my case. – TweetysOldFriend Sep 26 '21 at 23:56
  • You are not using a GRU, but the principle's the same. – MWB Sep 27 '21 at 00:22

1 Answers1

1

After backpropagation, the leaf nodes' gradients are stored in their Tensor.grad attributes. The gradients of non-leaf nodes (i.e. the intermediate results to which the error refers) are freed by default, as PyTorch assumes you won't need them. In your example, your leaf nodes are those in vector_list created from torch.randn().

Calling backward() multiple times consecutively accumulates gradients via summation by default (this is useful for recurrent neural networks). This is problematic when existing intermediate results have been freed; the leaf nodes' gradients have not; and the call to backward() involves some of the same leaf nodes and intermediate results as a previous call to backward(). This is the problem you're facing; some of your tensor slices reference the same underlying tensors, and you're not zeroing all the relevant gradients between calls to backward(), but you are implicitly zeroing intermediate gradients.

If you wish to accumulate gradients in the leaf nodes via summation, simply call backward like so: summed.backward(retain_graph = True).

However, if you wish to compute gradients with respect to your batches independently (rather than w.r.t. the leaf nodes in vector_list), then you can just detach your batches at the beginning of each iteration. This will prevent gradients from propagating through them all the way to their common leaf nodes in vector_list (i.e. they become leaf nodes themselves in their own graphs). Detaching a tensor disables gradients for it, so you'll have to re-enable them manually:

for i in batched_feats:
    i = i.detach()
    i.requires_grad = True
    j = i + 5
    print(j.shape)
    summed = torch.sum(j)
    summed.backward()
    print(i.grad) # Prints the gradients stored in i

This is how some data loaders work; they load the data from disk, convert them to tensors, perform augmentation / other preprocessing, and then detach them so that they can serve as leaf nodes in a fresh computational graph. If the application developer wants to compute gradients w.r.t. the data tensors, they do not have to save intermediate results since the data tensors have been detached and thus serve as leaf nodes.

Alexander Guyer
  • 2,063
  • 1
  • 14
  • 20
  • Thanks for your answer. I was wondering if it is possible to use different disjoint slices of a tensor as input in different batches. The problem I am facing is that since the gradients are associated with the original tensor only (and not the slices), after the first slice is used in a batch, the intermediate results are freed and I get this error of doing backward twice. – TweetysOldFriend Sep 27 '21 at 18:06
  • My main network is as follows. Inputs pass through a feature extractor model. The resulting features can be of variable length in the time dimension because of variable length inputs sequences. If the length of features (in time dimension) is less than a threshold, I pass them to an LSTM model as one batch, else I slice the feature tensor into 4 disjoint parts and send these as 4 batches to the LSTM. The LSTM and feature extractor need to be trained jointly, so I want the gradients to flow through the slices. How to avoid the twice backward error and use feature slices in different batches – TweetysOldFriend Sep 27 '21 at 18:14
  • `retain_graph=True` doesn't solve the issue in main network. I get `RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation ...` error on using it. I doubt that it's again due to slices being computed in the loop as shown above and assigned to same variable `mini_pairs`. I tried cloning the slices but that didn't help. I've been stuck with this problem for a long time and have tried everything in my capacity. You were the only kind stranger who took time to answer my question. I would be grateful if you could further guide me. Thanks a lot! – TweetysOldFriend Sep 27 '21 at 18:20
  • I don't get any errors running just the snippet provided with `retain_graph=True`, so the runtime error must be caused somewhere else (maybe in the feature extractor?) You might start by just replacing inplace operations with alternatives and see if the error goes away. I also don't think any of the cloning going on here is doing anything helpful. Lastly, if you're running with `retain_graph=True`, be extra careful to zero out gradients when you're done with them (e.g. `optimizer.zero_grad()`). – Alexander Guyer Sep 28 '21 at 20:18