Fluctuating loss during training for text binary classification

Question

I'm doing a finetuning of a Longformer on a document text binary classification task using Huggingface Trainer class and I'm monitoring the measures of some checkpoints with Tensorboard.

Even if the F1 score and accuracy is quite high, I have perplexities about the fluctuations of training loss.

I read online a reason for that can be:

the too high learning rate, but I tried with 3 values (1e-4, 1e-5 and 1e-6) and all of them made the same effect
a small batch size. I'm using a Sagemaker notebook p2.8xlarge which has 8xK80 GPUs. The batch size per GPU I can use to avoid the CUDA out of memory error is 1. So the total batch size is 8. My intuition is that a bs of 8 is too small for a dataset containing 57K examples (7K steps per epoch). Unfortunately it's the highest value I can use.

Here I have reported the trend of F1, accuracy, loss and smoothed loss. The grey line is with 1e-6 of learning rate while the pink one is 1e-5.

I reasume all the info of my training:

batch size: 1 x 8GPU = 8
learning rate: 1e-4, 1e-5, 1e-6 (all of them tested without improvement on loss)
model: Longformer
dataset:
- training set: 57K examples
- dev set: 12K examples
- test set: 12K examples

Which could be the reason? Can this be considered a problem despite the quite good F1 and accuracy results?

Can this be considered a problem for the predictions of the model? Do you think that with higher bs also f1 can improve? — Paolo Magnani, Sep 04 '20 at 15:22
F1 scores depend on both precision ad recall. You could get a smooth `loss` curve if you increase batch_size. But the F1 score depends on how well your model performs on all the classes. So, number of examples in each class also influence f1 score. — Aniket Bote, Sep 04 '20 at 15:30
So maybe, keeping fixed bs, can reducing training set size be an improvement ? — Paolo Magnani, Sep 04 '20 at 15:45
Reducing training set size shouldn't help. This probably isn't a major problem if your eval metrics/results look good. The reason you see this with small batches is that you can get an "easy" batch where say 5/8 examples are pretty easy, and 3/8 are sort of hard. If you want to try increasing batch sizes, you can probably try gradient accumulation and/or gradient checkpointing, both of which can allow you to do more processing in a single step, at the cost of being slower — cnapun, Sep 05 '20 at 04:55

score 2 · Answer 1 · answered Apr 10 '21 at 12:54

I will first tell you the reason for the fluctuations and then a possible way to solve it.

REASON

When you train a network, you calculate a gradient that would reduce the loss. In order to do that, you need to backpropagate the loss. Now, ideally, you compute the loss based on all of the samples in your data because then you consider basically every sample and you come up with a gradient that would capture all of your samples. In practice, this is not possible due to the computational complexity of calculating gradient on all samples.

Therefore, we use small batch_size as an approximation! The idea is instead of considering all the samples, we say I compute the gradient-based on some small set of samples but as a trade-off I lose information regarding the gradient.

Rule of thumb: Smaller batch sizes give noisy gradients but they converge faster because per epoch you have more updates. If your batch size is 1 you will have N updates per epoch. If it is N, you will only have 1 update per epoch. On the other hand, larger batch sizes give a more informative gradient but they convergence slower and increase computational conplexity.

That is the reason why for smaller batch sizes, you observe varying losses/fluctuations because the gradient is noisy.

SOLUTION: Accumulated Gradients

In case of memory issues, you can use the concept of accumulated gradients to combat the fluctuating loss. It calculates the loss and gradients after each mini-batch, but instead of updating the weights on every batch, it waits and accumulates the gradients over consecutive batches. And then ultimately updates the parameters based on the cumulative gradient after a specified number of batches.

On this page from the documentation you can find how to apply it: https://huggingface.co/transformers/v1.2.0/examples.html

Fluctuating loss during training for text binary classification

1 Answers1