Upon running the code snippet (PyTorch 1.7.1; Python 3.8),
import numpy as np
import torch
def batch_matrix(vector_pairs, factor=2):
baselen = len(vector_pairs[0]) // factor
split_batch = []
for j in range(factor):
for i in range(factor):
start_j = j * baselen
end_j = (j+1) * baselen if j != factor - 1 else None
start_i = i * baselen
end_i = (i+1) * baselen if i != factor - 1 else None
mini_pairs = vector_pairs[start_j:end_j, start_i:end_i, :]
split_batch.append(mini_pairs)
return split_batch
def concat_matrix(vectors_):
vectors = vectors_.clone()
seq_len, dim_vec = vectors.shape
project_x = vectors.repeat((1, 1, seq_len)).reshape(seq_len, seq_len, dim_vec)
project_y = project_x.permute(1, 0, 2)
matrix = torch.cat((project_x, project_y), dim=-1)
matrix_ = matrix.clone()
return matrix_
if __name__ == "__main__":
vector_list = []
for i in range(10):
vector_list.append(torch.randn((5,), requires_grad=True))
vectors = torch.stack(vector_list, dim=0)
pmatrix = concat_matrix(vectors)
factor = np.ceil(vectors.shape[0]/6).astype(int)
batched_feats = batch_matrix(pmatrix, factor=factor)
for i in batched_feats:
i = i + 5
print(i.shape)
summed = torch.sum(i)
summed.backward()
I get the output and error as below:
torch.Size([5, 5, 10])
torch.Size([5, 5, 10])
Traceback (most recent call last):
File "/home/user/PycharmProjects/project/run.py", line 43, in <module>
summed.backward()
File "/home/user/anaconda3/envs/diff/lib/python3.8/site-packages/torch/tensor.py", line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/user/anaconda3/envs/diff/lib/python3.8/site-packages/torch/autograd/__init__.py", line 130, in backward
Variable._execution_engine.run_backward(
RuntimeError: Trying to backward through the graph a second time, but the saved intermediate results have already been freed. Specify retain_graph=True when calling backward the first time.
I have read all the existing posts on the issue and could not resolve it myself. Passing retain_graph=True
in backward() fixes the issue in the provided snippet, however, the snippet is only an oversimplified version of a large network where retain_graph=True
changes the error to the following:
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [3000, 512]], which is output 0 of TBackward, is at version 3; expected version 2 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
I tried setting torch.autograd.set_detect_anomaly(True)
and determining the point of failure, but all that I tried failed and the error persisted.
I suspect that if I can understand the cause of error in the current situation then it will help me resolve this error in actual codebase.
Therefore, I want to understand why is it that backward()
works fine for first two tensors in batched_feats
, while fails for the third one? I would really appreciate if someone can help me see the reuse of an intermediate result that has been freed.
Thanks a lot!