In order to mimick a larger batch size, I want to be able to accumulate gradients every N batches for a model in PyTorch, like:
def train(model, optimizer, dataloader, num_epochs, N):
for epoch_num in range(1, num_epochs+1):
for batch_num, data in enumerate(dataloader):
ims = data.to('cuda:0')
loss = model(ims)
loss.backward()
if batch_num % N == 0:
optimizer.step()
optimizer.zero_grad(set_to_none=True)
For this approach do I need to add the flag retain_graph=True
, i.e.
loss.backward(retain_graph=True)
In this manner, are the gradients per each backward call simply summed per each parameter?