25

I'm writing a custom training loop using the code provided in the Tensorflow DCGAN implementation guide. I wanted to add callbacks in the training loop. In Keras I know we pass them as an argument to the 'fit' method, but can't find resources on how to use these callbacks in the custom training loop. I'm adding the code for the custom training loop from the Tensorflow documentation:

# Notice the use of `tf.function`
# This annotation causes the function to be "compiled".
@tf.function
def train_step(images):
    noise = tf.random.normal([BATCH_SIZE, noise_dim])

    with tf.GradientTape() as gen_tape, tf.GradientTape() as disc_tape:
      generated_images = generator(noise, training=True)

      real_output = discriminator(images, training=True)
      fake_output = discriminator(generated_images, training=True)

      gen_loss = generator_loss(fake_output)
      disc_loss = discriminator_loss(real_output, fake_output)

    gradients_of_generator = gen_tape.gradient(gen_loss, generator.trainable_variables)
    gradients_of_discriminator = disc_tape.gradient(disc_loss, discriminator.trainable_variables)

    generator_optimizer.apply_gradients(zip(gradients_of_generator, generator.trainable_variables))
    discriminator_optimizer.apply_gradients(zip(gradients_of_discriminator, discriminator.trainable_variables))

def train(dataset, epochs):
  for epoch in range(epochs):
    start = time.time()

    for image_batch in dataset:
      train_step(image_batch)

    # Produce images for the GIF as we go
    display.clear_output(wait=True)
    generate_and_save_images(generator,
                             epoch + 1,
                             seed)

    # Save the model every 15 epochs
    if (epoch + 1) % 15 == 0:
      checkpoint.save(file_prefix = checkpoint_prefix)

    print ('Time for epoch {} is {} sec'.format(epoch + 1, time.time()-start))

  # Generate after the final epoch
  display.clear_output(wait=True)
  generate_and_save_images(generator,
                           epochs,
                           seed)
Nicolas Gervais
  • 33,817
  • 13
  • 115
  • 143
Umair Khawaja
  • 403
  • 1
  • 4
  • 9

6 Answers6

14

I've had this problem myself: (1) I want to use a custom training loop; (2) I don't want to lose the bells and whistles Keras gives me in terms of callbacks; (3) I don't want to re-implement them all myself. Tensorflow has a design philosophy of allowing a developer to gradually opt-in to its more low-level APIs. As @HyeonPhilYoun notes in his comment below, the official documentation for tf.keras.callbacks.Callback gives an example of what we're looking for.

The following has worked for me, but can be improved by reverse engineering tf.keras.Model.

The trick is to use tf.keras.callbacks.CallbackList and then manually trigger its lifecycle events from within your custom training loop. This example uses tqdm to give attractive progress bars, but CallbackList has a progress_bar initialization argument that can let you use the defaults. training_model is a typical instance of tf.keras.Model.

from tqdm.notebook import tqdm, trange

# Populate with typical keras callbacks
_callbacks = []

callbacks = tf.keras.callbacks.CallbackList(
    _callbacks, add_history=True, model=training_model)

logs = {}
callbacks.on_train_begin(logs=logs)

# Presentation
epochs = trange(
    max_epochs,
    desc="Epoch",
    unit="Epoch",
    postfix="loss = {loss:.4f}, accuracy = {accuracy:.4f}")
epochs.set_postfix(loss=0, accuracy=0)

# Get a stable test set so epoch results are comparable
test_batches = batches(test_x, test_Y)

for epoch in epochs:
    callbacks.on_epoch_begin(epoch, logs=logs)

    # I like to formulate new batches each epoch
    # if there are data augmentation methods in play
    training_batches = batches(x, Y)

    # Presentation
    enumerated_batches = tqdm(
        enumerate(training_batches),
        desc="Batch",
        unit="batch",
        postfix="loss = {loss:.4f}, accuracy = {accuracy:.4f}",
        position=1,
        leave=False)

    for (batch, (x, y)) in enumerated_batches:
        training_model.reset_states()
        
        callbacks.on_batch_begin(batch, logs=logs)
        callbacks.on_train_batch_begin(batch, logs=logs)
        
        logs = training_model.train_on_batch(x=x, y=Y, return_dict=True)

        callbacks.on_train_batch_end(batch, logs=logs)
        callbacks.on_batch_end(batch, logs=logs)

        # Presentation
        enumerated_batches.set_postfix(
            loss=float(logs["loss"]),
            accuracy=float(logs["accuracy"]))

    for (batch, (x, y)) in enumerate(test_batches):
        training_model.reset_states()

        callbacks.on_batch_begin(batch, logs=logs)
        callbacks.on_test_batch_begin(batch, logs=logs)

        logs = training_model.test_on_batch(x=x, y=Y, return_dict=True)

        callbacks.on_test_batch_end(batch, logs=logs)
        callbacks.on_batch_end(batch, logs=logs)

    # Presentation
    epochs.set_postfix(
        loss=float(logs["loss"]),
        accuracy=float(logs["accuracy"]))

    callbacks.on_epoch_end(epoch, logs=logs)

    # NOTE: This is a decent place to check on your early stopping
    # callback.
    # Example: use training_model.stop_training to check for early stopping


callbacks.on_train_end(logs=logs)

# Fetch the history object we normally get from keras.fit
history_object = None
for cb in callbacks:
    if isinstance(cb, tf.keras.callbacks.History):
        history_object = cb
assert history_object is not None
Rob Hall
  • 2,693
  • 16
  • 22
  • 3
    Thank you for this extensive answer! This was really helpful for me. Annoying that there isn't more official documentation regarding callbacks in a custom loop! – emil Jun 12 '21 at 15:54
  • 1
    An official documentation also notes that this way is the most appropriate one. You can check in example section. https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/Callback – HyeonPhil Youn Sep 26 '21 at 16:31
  • 1
    This should be accepted as the official answer.. great work! – Rayyan Mar 25 '22 at 14:03
  • 1
    This is great. Tried it out with Tensorboard callbacks and it worked like a charm. So in my case it looked like this: ''' tensorboard_callback = keras.callbacks.TensorBoard( log_dir='./callbacks/tensorboard', histogram_freq=1) _callbacks = [tensorboard_callback] callbacks = keras.callbacks.CallbackList( _callbacks, add_history=True, model=encoder) logs_ae = {} callbacks.on_train_begin(logs=logs_ae) ... ... ''' – UlrikP Apr 11 '22 at 08:51
5

The simplest way would be to check if the loss has changed over your expected period and break or manipulate the training process if not. Here is one way you could implement a custom early stopping callback :

def Callback_EarlyStopping(LossList, min_delta=0.1, patience=20):
    #No early stopping for 2*patience epochs 
    if len(LossList)//patience < 2 :
        return False
    #Mean loss for last patience epochs and second-last patience epochs
    mean_previous = np.mean(LossList[::-1][patience:2*patience]) #second-last
    mean_recent = np.mean(LossList[::-1][:patience]) #last
    #you can use relative or absolute change
    delta_abs = np.abs(mean_recent - mean_previous) #abs change
    delta_abs = np.abs(delta_abs / mean_previous)  # relative change
    if delta_abs < min_delta :
        print("*CB_ES* Loss didn't change much from last %d epochs"%(patience))
        print("*CB_ES* Percent change in loss value:", delta_abs*1e2)
        return True
    else:
        return False

This Callback_EarlyStopping checks your metrics/loss every epoch and returns True if the relative change is less than what you expected by computing moving average of losses after every patience number of epochs. You can then capture this True signal and break the training loop. To completely answer your question, within your sample training loop you can use this as:

gen_loss_seq = []
for epoch in range(epochs):
  #in your example, make sure your train_step returns gen_loss
  gen_loss = train_step(dataset) 
  #ideally, you can have a validation_step and get gen_valid_loss
  gen_loss_seq.append(gen_loss)  
  #check every 20 epochs and stop if gen_valid_loss doesn't change by 10%
  stopEarly = Callback_EarlyStopping(gen_loss_seq, min_delta=0.1, patience=20)
  if stopEarly:
    print("Callback_EarlyStopping signal received at epoch= %d/%d"%(epoch,epochs))
    print("Terminating training ")
    break
       

Of course, you can increase the complexity in numerous ways, for example, which loss or metrics you would like to track, your interest in the loss at a particular epoch or moving average of loss, your interest in relative or absolute change in value, etc. You can refer to Tensorflow 2.x implementation of tf.keras.callbacks.EarlyStopping here which is generally used in the popular tf.keras.Model.fit method.

Aakash Patil
  • 159
  • 3
  • 7
  • 5
    Unfortunately, this answer applies only to the very specific case, where one wants to use the EarlyStopping callback. However, there a much more other useful callbacks that one might like to reuse and not implement from scratch. – Stanley F. Feb 11 '21 at 12:19
2

I think you would need to implement the functionality of the callback manually. It should not be too difficult. You could for instance have the "train_step" function return the losses and then implement functionality of callbacks such as early stopping in your "train" function. For callbacks such as learning rate schedule the function tf.keras.backend.set_value(generator_optimizer.lr,new_lr) would come in handy. Therefore the functionality of the callback would be implemented in your "train" function.

Paul Mwaniki
  • 351
  • 3
  • 7
2

A custom training loop is just a normal Python loop, so you can use if statements to break the loop whenever some condition is met. For instance:

if len(loss_history) > patience:
    if loss_history.popleft()*delta < min(loss_history):
        print(f'\nEarly stopping. No improvement of more than {delta:.5%} in '
              f'validation loss in the last {patience} epochs.')
        break

If there is no improvement of delta% in the loss in the past patience epochs, the loop will be broken. Here, I'm using a collections.deque, which can easily be used as a rolling list that keeps in memory information only the last patience epochs.

Here's a full implementation, with the documentation example from the Tensorflow documentation:

patience = 3
delta = 0.001

loss_history = deque(maxlen=patience + 1)

for epoch in range(1, 25 + 1):
    train_loss = tf.metrics.Mean()
    train_acc = tf.metrics.CategoricalAccuracy()
    test_loss = tf.metrics.Mean()
    test_acc = tf.metrics.CategoricalAccuracy()

    for x, y in train:
        loss_value, grads = get_grad(model, x, y)
        optimizer.apply_gradients(zip(grads, model.trainable_variables))
        train_loss.update_state(loss_value)
        train_acc.update_state(y, model(x, training=True))

    for x, y in test:
        loss_value, _ = get_grad(model, x, y)
        test_loss.update_state(loss_value)
        test_acc.update_state(y, model(x, training=False))

    print(verbose.format(epoch,
                         train_loss.result(),
                         test_loss.result(),
                         train_acc.result(),
                         test_acc.result()))

    loss_history.append(test_loss.result())

    if len(loss_history) > patience:
        if loss_history.popleft()*delta < min(loss_history):
            print(f'\nEarly stopping. No improvement of more than {delta:.5%} in '
                  f'validation loss in the last {patience} epochs.')
            break
Epoch  1 Loss: 0.191 TLoss: 0.282 Acc: 68.920% TAcc: 89.200%
Epoch  2 Loss: 0.157 TLoss: 0.297 Acc: 70.880% TAcc: 90.000%
Epoch  3 Loss: 0.133 TLoss: 0.318 Acc: 71.560% TAcc: 90.800%
Epoch  4 Loss: 0.117 TLoss: 0.299 Acc: 71.960% TAcc: 90.800%

Early stopping. No improvement of more than 0.10000% in validation loss in the last 3 epochs.
Nicolas Gervais
  • 33,817
  • 13
  • 115
  • 143
2

aapa3e8's answer is correct but I am providing an implementation of Callback_EarlyStopping below that is more similar to tf.keras.callbacks.EarlyStopping

def Callback_EarlyStopping(MetricList, min_delta=0.1, patience=20, mode='min'):
    #No early stopping for the first patience epochs 
    if len(MetricList) <= patience:
        return False
    
    min_delta = abs(min_delta)
    if mode == 'min':
      min_delta *= -1
    else:
      min_delta *= 1
    
    #last patience epochs 
    last_patience_epochs = [x + min_delta for x in MetricList[::-1][1:patience + 1]]
    current_metric = MetricList[::-1][0]
    
    if mode == 'min':
        if current_metric >= max(last_patience_epochs):
            print(f'Metric did not decrease for the last {patience} epochs.')
            return True
        else:
            return False
    else:
        if current_metric <= min(last_patience_epochs):
            print(f'Metric did not increase for the last {patience} epochs.')
            return True
        else:
            return False
Matthew Thomas
  • 594
  • 7
  • 13
  • 3
    Unfortunately, the question is not about early stopping but about callbacks in general. Why does everyone here believes, the questioner want only this specific callback? – Stanley F. Feb 11 '21 at 12:21
0

I tested @Rob Hall's method with tensorboard callbacks and it did indeed work. So in my case it looked like this:

'''

tensorboard_callback = keras.callbacks.TensorBoard(
    log_dir='./callbacks/tensorboard',
    histogram_freq=1)

_callbacks = [tensorboard_callback]
callbacks = keras.callbacks.CallbackList(
    _callbacks, add_history=True, model=encoder)

    logs_ae = {}
    callbacks.on_train_begin(logs=logs_ae)
...
...

'''

UlrikP
  • 412
  • 3
  • 8