27

I tried to implement an early stopping function to avoid my neural network model overfit. I'm pretty sure that the logic is fine, but for some reason, it doesn't work. I want that when the validation loss is greater than the training loss over some epochs, the early stopping function returns True. But it returns False all the time, even though the validation loss becomes a lot greater than the training loss. Could you see where is the problem, please?

early stopping function

def early_stopping(train_loss, validation_loss, min_delta, tolerance):

    counter = 0
    if (validation_loss - train_loss) > min_delta:
        counter +=1
        if counter >= tolerance:
          return True

calling the function during the training

for i in range(epochs):
    
    print(f"Epoch {i+1}")
    epoch_train_loss, pred = train_one_epoch(model, train_dataloader, loss_func, optimiser, device)
    train_loss.append(epoch_train_loss)

    # validation 

    with torch.no_grad(): 
       epoch_validate_loss = validate_one_epoch(model, validate_dataloader, loss_func, device)
       validation_loss.append(epoch_validate_loss)
    
    # early stopping
    if early_stopping(epoch_train_loss, epoch_validate_loss, min_delta=10, tolerance = 20):
      print("We are at epoch:", i)
      break

EDIT: The train and validation loss: enter image description here enter image description here

EDIT2:

def train_validate (model, train_dataloader, validate_dataloader, loss_func, optimiser, device, epochs):
    preds = []
    train_loss =  []
    validation_loss = []
    min_delta = 5
    

    for e in range(epochs):
        
        print(f"Epoch {e+1}")
        epoch_train_loss, pred = train_one_epoch(model, train_dataloader, loss_func, optimiser, device)
        train_loss.append(epoch_train_loss)

        # validation 
        with torch.no_grad(): 
           epoch_validate_loss = validate_one_epoch(model, validate_dataloader, loss_func, device)
           validation_loss.append(epoch_validate_loss)
        
        # early stopping
        early_stopping = EarlyStopping(tolerance=2, min_delta=5)
        early_stopping(epoch_train_loss, epoch_validate_loss)
        if early_stopping.early_stop:
            print("We are at epoch:", e)
            break

    return train_loss, validation_loss
Totoro
  • 474
  • 2
  • 6
  • 18

3 Answers3

52

Although @KarelZe's response solves your problem sufficiently and elegantly, I want to provide an alternative early stopping criterion that is arguably better.

Your early stopping criterion is based on how much (and for how long) the validation loss diverges from the training loss. This will break when the validation loss is indeed decreasing but is generally not close enough to the training loss. The goal of training a model is to encourage the reduction of validation loss and not the reduction in the gap between training loss and validation loss.

Hence, I would argue that a better early stopping criterion would be watch for the trend in validation loss alone, i.e., if the training is not resulting in lowering of the validation loss then terminate it. Here's an example implementation:

class EarlyStopper:
    def __init__(self, patience=1, min_delta=0):
        self.patience = patience
        self.min_delta = min_delta
        self.counter = 0
        self.min_validation_loss = np.inf

    def early_stop(self, validation_loss):
        if validation_loss < self.min_validation_loss:
            self.min_validation_loss = validation_loss
            self.counter = 0
        elif validation_loss > (self.min_validation_loss + self.min_delta):
            self.counter += 1
            if self.counter >= self.patience:
                return True
        return False

Here's how you'd use it:

early_stopper = EarlyStopper(patience=3, min_delta=10)
for epoch in np.arange(n_epochs):
    train_loss = train_one_epoch(model, train_loader)
    validation_loss = validate_one_epoch(model, validation_loader)
    if early_stopper.early_stop(validation_loss):             
        break
isle_of_gods
  • 670
  • 1
  • 5
  • 10
  • 2
    Thank you very much for your answer. It's a new idea and so amazing. So kind of you! – Totoro Oct 05 '22 at 12:49
  • 1
    Thanks for this solution! I was just wondering why earlier solutions were checking the gap between train and val? That should not be the criteria isnt it? Or am I missing on something? – Pallavi Nov 07 '22 at 06:23
  • I don't know why, but I believe it could just have resulted out of a cognitive bias from seeing typical training and validation curves shown in texts and blogs, where overfitting is always identified when the validation loss curve starts to deviate from the training loss curve. – isle_of_gods Jan 02 '23 at 12:24
  • Shouldn't the `min_delta` be used in deciding a model's improvement? https://keras.io/api/callbacks/early_stopping/ – Crispy13 Jun 06 '23 at 23:16
  • @Crispy13 that's exactly what `min_delta` here is being used for. See the check ` if validation_loss > (self.min_validation_loss + self.min_delta)`. Or am I missing something here? – isle_of_gods Jun 15 '23 at 10:48
  • @isle_of_gods For example, `self.min_validation_loss`= 0.5, `validation_loss`=0.45 and `min_delta`=0.1, the class will determine the model has improved which was not supposed to be. – Crispy13 Jun 19 '23 at 01:20
  • @Crispy13 if validation loss decreases even a slight amount compared to min validation loss then I'd infer that the training is successfully moving forward. This is because I find it a bit difficult to judge to what extent each training step should decrease the validation loss. I would only add a threshold for the case the the loss doesn't decrease (i.e., stays flat / increases) compared to min validation loss. Technically, one can do both, and wouldn't be wrong. – isle_of_gods Jun 19 '23 at 11:48
17

The problem with your implementation is that whenever you call early_stopping() the counter is re-initialized with 0.

Here is working solution using an oo-oriented approch with __call__() and __init__() instead:

class EarlyStopping:
    def __init__(self, tolerance=5, min_delta=0):

        self.tolerance = tolerance
        self.min_delta = min_delta
        self.counter = 0
        self.early_stop = False

    def __call__(self, train_loss, validation_loss):
        if (validation_loss - train_loss) > self.min_delta:
            self.counter +=1
            if self.counter >= self.tolerance:  
                self.early_stop = True

Call it like that:

early_stopping = EarlyStopping(tolerance=5, min_delta=10)

for i in range(epochs):
    
    print(f"Epoch {i+1}")
    epoch_train_loss, pred = train_one_epoch(model, train_dataloader, loss_func, optimiser, device)
    train_loss.append(epoch_train_loss)

    # validation 
    with torch.no_grad(): 
       epoch_validate_loss = validate_one_epoch(model, validate_dataloader, loss_func, device)
       validation_loss.append(epoch_validate_loss)
    
    # early stopping
    early_stopping(epoch_train_loss, epoch_validate_loss)
    if early_stopping.early_stop:
      print("We are at epoch:", i)
      break

Example:

early_stopping = EarlyStopping(tolerance=2, min_delta=5)

train_loss = [
    642.14990234,
    601.29278564,
    561.98400879,
    530.01501465,
    497.1098938,
    466.92709351,
    438.2364502,
    413.76028442,
    391.5090332,
    370.79074097,
]
validate_loss = [
    509.13619995,
    497.3125,
    506.17315674,
    497.68960571,
    505.69918823,
    459.78610229,
    480.25592041,
    418.08630371,
    446.42675781,
    372.09902954,
]

for i in range(len(train_loss)):

    early_stopping(train_loss[i], validate_loss[i])
    print(f"loss: {train_loss[i]} : {validate_loss[i]}")
    if early_stopping.early_stop:
        print("We are at epoch:", i)
        break

Output:

loss: 642.14990234 : 509.13619995
loss: 601.29278564 : 497.3125
loss: 561.98400879 : 506.17315674
loss: 530.01501465 : 497.68960571
loss: 497.1098938 : 505.69918823
loss: 466.92709351 : 459.78610229
loss: 438.2364502 : 480.25592041
We are at epoch: 6
ospider
  • 9,334
  • 3
  • 46
  • 46
KarelZe
  • 1,466
  • 1
  • 11
  • 21
  • 1
    Thank you very much for your answer. It is more elegant to write it this way. But it doesn't work either! :( P.S. I did a minor edit to your code: self.counter +=1 and self.counter >= self.tolerance – Totoro Apr 25 '22 at 13:11
  • Alright. Great to hear. Didn't run the code previously due to the missing model. Do you need further assistance? – KarelZe Apr 25 '22 at 13:38
  • So you have no idea why this early stopping class cannot stop the training even though the model obviously overfit? I'm actually a bit in a deadlock. – Totoro Apr 25 '22 at 14:07
  • I'd like to dwell on it later. Could you please a print out of your validation and test loss to get a feeling for your loss? – KarelZe Apr 25 '22 at 14:25
  • 1
    Yes, of course. – Totoro Apr 25 '22 at 14:50
  • 1
    @ Totoro. Thanks. Happy to look into it. – KarelZe Apr 25 '22 at 20:16
  • I edited my post by adding the train and validation loss. Around epoch 50 we can see that the validation loss is increasing. Here, I had tolerance=2 and min_delta=5. It should have ended the training but it continued till the last epoch. – Totoro Apr 26 '22 at 08:32
  • I believe that the problem comes from how I use the instance of the class in my training loop. But I don't exactly know how to fix it :( – Totoro Apr 26 '22 at 10:34
  • 1
    @Totoro Please provide print outs as text next time. I added an example. Given the sample losses you provided, training is stopped early. Not sure how or where you added it. – KarelZe Apr 26 '22 at 10:43
  • 1
    Thank you so much. I will provide the data as you said next time. I didn't know. – Totoro Apr 27 '22 at 08:24
1

It may help someone like myself, I would like to add upon previous answers.

Both of the answers provided have different interpretations of the min_delta parameter. In @KarelZe's answer, min_delta is used as the gap between train_loss and validation_loss:

if (validation_loss - train_loss) > self.min_delta: 
        self.counter +=1

On the other hand, in @isle_of_gods' answer, min_delta is used to increment the counter when the new validation loss is at least min_delta greater than the current minimum validation loss:

elif validation_loss > (self.min_validation_loss + self.min_delta):
        self.counter += 1

Although non of these answers are wrong, since it depends on ones' needs, I think it is more intuitive to consider min_delta as the minimum change required to consider the model as improving. The documentation from Keras, which is equally popular as PyTorch, defines the min_delta parameter in their early stopping mechanism as follows:

min_delta: Minimum change in the monitored quantity to qualify as an improvement, i.e. an absolute change of less than min_delta, will count as no improvement.

That means, any decrease validation loss value will not be counted a decrease unless the decrease is larger than min_delta

To align with the Keras documentation, @isle_of_gods' code can be modified as follows:

class ValidationLossEarlyStopping:
def __init__(self, patience=1, min_delta=0.0):
    self.patience = patience  # number of times to allow for no improvement before stopping the execution
    self.min_delta = min_delta  # the minimum change to be counted as improvement
    self.counter = 0  # count the number of times the validation accuracy not improving
    self.min_validation_loss = np.inf

# return True when encountering _patience_ times decrease in validation loss 
def early_stop_check(self, validation_loss):
    if ((validation_loss+self.min_delta) < self.min_validation_loss):
        self.min_validation_loss = validation_loss
        self.counter = 0  # reset the counter if validation loss decreased at least by min_delta
    elif ((validation_loss+self.min_delta) > self.min_validation_loss):
        self.counter += 1 # increase the counter if validation loss is not decreased by the min_delta
        if self.counter >= self.patience:
            return True
    return False
Nawras
  • 181
  • 1
  • 12