I'm trying to train a model in Pytorch, and I'd like to have a batch size of 8, but due to memory limitations, I can only have a batch size of at most 4. I've looked all around and read a lot about accumulating gradients, and it seems like the solution to my problem.
However, I seem to have trouble implementing it. Every time I run the code I get RuntimeError: Trying to backward through the graph a second time
. I don't understand why since my code looks like all these other examples I've seen (unless I'm just missing something major):
- https://stackoverflow.com/a/62076913/1227353
- https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255
- https://discuss.pytorch.org/t/why-do-we-need-to-set-the-gradients-manually-to-zero-in-pytorch/4903/20
One caveat is that the labels for my images are all different size, so I can't send the output batch and the label batch into the loss function; I have to iterate over them together. This is what an epoch looks like (it's been pared down for the sake of brevity):
# labels_batch contains labels of different sizes
for batch_idx, (inputs_batch, labels_batch) in enumerate(dataloader):
outputs_batch = model(inputs_batch)
# have to do this because labels can't be stacked into a tensor
for output, label in zip(outputs_batch, labels_batch):
output_scaled = interpolate(...) # make output match label size
loss = train_criterion(output_scaled, label) / (BATCH_SIZE * 2)
loss.backward()
if batch_idx % 2 == 1:
optimizer.step()
optimizer.zero_grad()
Is there something I'm missing? If I do the following I also get an error:
# labels_batch contains labels of different sizes
for batch_idx, (inputs_batch, labels_batch) in enumerate(dataloader):
outputs_batch = model(inputs_batch)
# CHANGE: we're gonna accumulate losses manually
batch_loss = 0
# have to do this because labels can't be stacked into a tensor
for output, label in zip(outputs_batch, labels_batch):
output_scaled = interpolate(...) # make output match label size
loss = train_criterion(output_scaled, label) / (BATCH_SIZE * 2)
batch_loss += loss # CHANGE: accumulate!
# CHANGE: do backprop outside for loop
batch_loss.backward()
if batch_idx % 2 == 1:
optimizer.step()
optimizer.zero_grad()
The error I get in this case is RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
. This happens when the next epoch starts though... (INCORRECT, SEE EDIT BELOW)
How can I train my model with gradient accumulation? Or am I doomed to train with a batch size of 4 or less?
Oh and as a side question, does the location of where I put loss.backward()
affect what I need to normalize the loss by? Or is it always normalized by BATCH_SIZE * 2
?
EDIT:
The second code segment was getting an error due to the fact that I was doing torch.set_grad_enabled(phase == 'train')
but I had forgotten to wrap the call to batch_loss.backward()
with an if phase == 'train'
... my bad
So now the second segment of code seems to work and do gradient accumulation, but why doesn't the first bit of code work? It feel equivalent to setting BATCH_SIZE
as 1. Furthermore, I'm creating a new loss
object each time, so shouldn't the calls to backward()
operate on different graphs entirely?