3

I am training a BERT model on a relatively small dataset and cannot afford to lose any labelled sample as they must all be used for training. Due to GPU memory constraints, I am using gradient accumulation to train on larger batches (e.g. 32). According to PyTorch documentation, gradient accumulation is implemented as follows:

scaler = GradScaler()

for epoch in epochs:
    for i, (input, target) in enumerate(data):
        with autocast():
            output = model(input)
            loss = loss_fn(output, target)
            loss = loss / iters_to_accumulate

        # Accumulates scaled gradients.
        scaler.scale(loss).backward()

        if (i + 1) % iters_to_accumulate == 0:
            # may unscale_ here if desired (e.g., to allow clipping unscaled gradients)

            scaler.step(optimizer)
            scaler.update()
            optimizer.zero_grad()

However, if you are using e.g. 110 training samples, with batch size 8 and accumulation step 4 (i.e. effective batch size 32), this method would only train the first 96 samples (i.e. 32 x 3), i.e. wasting 14 samples. In order to avoid this, I'd like to modify the code as follows (notice change to the final if statement):

scaler = GradScaler()

for epoch in epochs:
    for i, (input, target) in enumerate(data):
        with autocast():
            output = model(input)
            loss = loss_fn(output, target)
            loss = loss / iters_to_accumulate

        # Accumulates scaled gradients.
        scaler.scale(loss).backward()

        if (i + 1) % iters_to_accumulate == 0 or (i + 1) == len(data):
            # may unscale_ here if desired (e.g., to allow clipping unscaled gradients)

            scaler.step(optimizer)
            scaler.update()
            optimizer.zero_grad()

Is this correct and really that simple, or will this have any side effects? It seems very simple to me, but I've never seen it done before. Any help appreciated!

andrea
  • 482
  • 5
  • 22
  • Hi Andrea, as far as I know, if in your dataloader you use drop_last=False, it will create a smaller batch at the end of your training (with the remaining samples). Therefore, I think just increasing iter_to_accumulate by 1 before starting this for loop should fix it. – Lucas Ramos Jan 27 '21 at 00:47
  • See my reply to Shai. I am not sure how this answers my question – andrea Jan 28 '21 at 11:58

2 Answers2

4

As Lucas Ramos already mentioned, when using DataLoader where the underlying dataset's size is not divisible by the batch size, the default behavior is to have a smaller last batch:

drop_last (bool, optional) – set to True to drop the last incomplete batch, if the dataset size is not divisible by the batch size. If False and the size of dataset is not divisible by the batch size, then the last batch will be smaller. (default: False)

Your plan is basically implementing gradient accumulation combined with drop_last=False - that is having the last batch smaller than all others.
Therefore, in principle there's nothing wrong with training with varying batch sizes.

However, there is something you need to fix in your code:
The loss is averaged over the mini-batch. So, if you process mini batches in the usual way you do not need to worry about it. However, when accumulating gradients you do it explicitly by dividing the loss by iters_to_accumulate:

loss = loss / iters_to_accumulate

In the last mini batch (with smaller size) you need to change the value of iter_to_accumulate to reflect this smaller minibatch size!

I proposed this revised code, breaking the training loop into two: an outer loop on mini-batches, and an inner one that accumulates gradients per mini batch. Note how using an iter over the DataLoader helps breaking the training loop into two:

scaler = GradScaler()

for epoch in epochs: 
    bi = 0  # index batches
    # outer loop over minibatches
    data_iter = iter(data)
    while bi < len(data):
        # determine the range for this batch
        nbi = min(len(data), bi + iters_to_accumulate)
        # inner loop over the items of the mini batch - accumulating gradients
        for i in range(bi, nbi):
            input, target = data_iter.next()
            with autocast():
                output = model(input)
                loss = loss_fn(output, target)
                loss = loss / (nbi - bi)  # divide by the true batch size

            # Accumulates scaled gradients.
            scaler.scale(loss).backward()
        # done mini batch loop - gradients were accumulated, we can make an optimizatino step.
        
        # may unscale_ here if desired (e.g., to allow clipping unscaled gradients)
        scaler.step(optimizer)
        scaler.update()
        optimizer.zero_grad()
        bi = nbi 
Shai
  • 111,146
  • 38
  • 238
  • 371
  • Hi @Shai I am not sure how this answers my question. I am already using `drop_last = False`. However, I don't see how this relates to my issue, which is that no weights update step is taken for the last set of batches when using gradient accumulation. Ensuring that all samples are included in a batch and none is dropped doesn't guarantee that a weight update will be taken for that batch during gradient accumulation. Let me know if this is clear, or else I'll edit my question to clarify this point, thanks – andrea Jan 28 '21 at 11:57
  • Thanks a lot for the update @Shai. There is still one thing I don't understand if you don't mind. Firstly, in my example (I should have made it clearer) the variable `data` is a `DataLoader` instance, as opposed to a `torch.Tensor`, so it is not subscriptable, i.e. I can't do `data[bi:nbi]`, however that is not needed because the `__iter__` method on the `DataLoader` can easily deal with the last smaller mini batch. I do like your nice solution of changing `iter_to_accumulate`. However, I don't understand why you removed the `if` statement and are taking a gradient update at every forward pass – andrea Jan 28 '21 at 13:53
  • Apologies, I see why the `if` statement can be avoided, given you're using two loops. I guess the only thing I'll have to somewhat fix is the issue with `data[bi:nbi]` which doesn't work. Perhaps I can get around it by calling the `list` function, to make the `DataLoader` subscriptable, i.e. `for i, (input, target) in enumerate(list(data)[bi:nbi]):` What do you think of this solution? – andrea Jan 28 '21 at 14:00
  • The remaining issue being that the `DataLoader` instance, i.e. `data`, returns the samples in random order, so if I have a gradient accumulation step of say 4, and the for loop is doing `data[0:4]` the first time, then `data[4:8]`, `data[8:12]` etc. there is no guarantee that all training samples will be covered once, because every time the samples will be fetched in a random order, so some may appear more than once and some may never appear – andrea Jan 28 '21 at 14:22
  • @andrea "random order" of `DataLoader` means it _shuffles_ the examples, not drawing at random. – Shai Jan 31 '21 at 06:02
  • @andrea you can use `iter` over `data` to get the equivalent of `data[bi:nbi]`. See my updated answer. – Shai Jan 31 '21 at 06:10
  • This code does not correctly handle the last mini batch being potentially smaller in size due to `drop_last=False`, it only corrects for the last number of accumulations possibly being less, which is also necessary, but something different. Consider 1000 samples, batch size 128, accumulating 4 batches at a time. This results in 8 batches, which are accumulated into 2 sets of 4. `nbi - bi` is always 4, but the 8th batch only has 1000 % 128 = 104 samples instead of 128, so just dividing the loss by 4 is not correct. – pallgeuer Nov 10 '22 at 12:14
0

I was pretty sure I've seen this done before. Check out this code from Pytorch Lightning (functions _accumulated_batches_reached, _num_training_batches_reached and should_accumulate).

lbd
  • 261
  • 1
  • 4