Why is training loss oscilating up and down?

Question

I am using the TF2 research object detection API with the pre-trained EfficientDet D3 model from the TF2 model zoo. During training on my own dataset I notice that the total loss is jumping up and down - for example from 0.5 to 2.0 a few steps later, and then back to 0.75:

So all in all this training does not seem to be very stable. I thought the problem might be the learning rate, but as you can see in the charts above, I set the LR to decay during the training, it goes down to a really small value of 1e-15, so I don't see how this can be the problem (at least in the 2nd half of the training).

Also when I smooth the curves in Tensorboard, as in the 2nd image above, one can see the total loss going down, so the direction is correct, even though it's still on quite a high value. I would be interested why I can't achieve better results with my training set, but I guess that is another question. First I would be really interested why the total loss is going up and down so much the whole training. Any ideas?

PS: The pipeline.config file for my training can be found here.

It's three different kinds of plants, we have about ~20k training photos. — Matthias, Mar 05 '21 at 12:20

D Hudson · Accepted Answer · 2021-03-08T09:03:14.477

In your config it states that your batch size is 2. This is tiny and will cause a very volatile loss.

Try increasing your batch size substantially; try a value of 256 or 512. If you are constrained by memory, try increasing it via gradient accumulation.

Gradient accumulation is the process of synthesising a larger batch by combining the backwards passes from smaller mini-batches. You would run multiple backwards passes before updating the model's parameters.

Typically, a training loop would like this (I'm using pytorch-like syntax for illustrative purposes):

for model_inputs, truths in iter_batches():
    predictions = model(model_inputs)
    loss = get_loss(predictions, truths)
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

With gradient accumulation, you'll put several batches through and then update the model. This simulates a larger batch size without requiring the memory to actually put a large batch size through all at once:

accumulations = 10

for i, (model_inputs, truths) in enumerate(iter_batches()):
    predictions = model(model_inputs)
    loss = get_loss(predictions, truths)
    loss.backward()
    if (i - 1) % accumulations == 0:
        optimizer.step()
        optimizer.zero_grad()

Reading

I am indeed constrained by memory,- can you please elaborate on the "gradient accumulation"? — Matthias, Mar 08 '21 at 08:47
I added some information on gradient accumulation and reference reading. — D Hudson, Mar 08 '21 at 09:04

Why is training loss oscilating up and down?

1 Answers1