27

I'm using keras defined as submodule in tensorflow v2. I'm training my model using fit_generator() method. I want to save my model every 10 epochs. How can I achieve this?

In Keras (not as a submodule of tf), I can give ModelCheckpoint(model_savepath,period=10). But in tf v2, they've changed this to ModelCheckpoint(model_savepath, save_freq) where save_freq can be 'epoch' in which case model is saved every epoch. If save_freq is integer, model is saved after so many samples have been processed. But I want it to be after 10 epochs. How can I achieve this?

Nagabhushan S N
  • 6,407
  • 8
  • 44
  • 87

4 Answers4

30

Using tf.keras.callbacks.ModelCheckpoint use save_freq='epoch' and pass an extra argument period=10.

Although this is not documented in the official docs, that is the way to do it (notice it is documented that you can pass period, just doesn't explain what it does).

bluesummers
  • 11,365
  • 8
  • 72
  • 108
  • 3
    I get the below warning: `WARNING:tensorflow:'period' argument is deprecated. Please use 'save_freq' to specify the frequency in number of samples seen.` So, I guess, this feature is going out soon. In that case, how can I achieve this? – Nagabhushan S N Nov 27 '19 at 12:56
  • 2
    I believe that the only alternative is to calculate the number of examples per epoch, and pass that integer to `save_freq` times the number of epochs you want as interval between saves – bluesummers Nov 27 '19 at 13:56
  • 1
    @bluesummers "examples per epoch" This should be my batch size, right? – Tom Dec 20 '19 at 13:19
  • Examples per epoch is how many *samples* you want to pass through the network between checkpoints - this means if you have 100 samples (samples != batch, batch is a batch of samples) and you put 400, it will save every 4 epochs – bluesummers Dec 20 '19 at 14:02
  • 2
    I had the same question as asked by @NagabhushanSN. I calculated the number of samples per epoch to calculate the number of samples after which I want to save the model but it does not seem to work. Batch size=64, for the test case I am using 10 steps per epoch. If I want to save the model every 3 epochs, the number of samples is 64*10*3=1920. I use that for sav_freq but the output shows that the model is saved on epoch 1, epoch 2, epoch 9, epoch 11, epoch 14 and still running. Can't make sense of it. `period` option seems to work fine but with the message that it will be deprecated. – beeprogrammer Feb 01 '20 at 00:09
  • For me, `save_freq='epoch', period=10` doesn't work, i.e., after 10 epochs nothing gets saved. Looking at [the documentation](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/ModelCheckpoint), when `save_freq` is given an integer, the callback saves the model at end of this many batches. So it should be: the number of epochs times number of batches per epoch (and not times number of examples per epoch, as @bluesummers said). Correct me if I'm wrong. – nim.py Jul 16 '20 at 09:24
  • @nim.py interestring - what is the data format you are using? is it a `tf.data.Dataset`? Numpy arrays? Is it a generator? – bluesummers Jul 16 '20 at 12:00
  • @bluesummers I am using a numpy array. On Tensorflow version 2.2.0, Keras 2.4.3. Maybe it doesn't work cause `period` is deprecated? – nim.py Jul 16 '20 at 13:48
  • @nim.py yes, the doc says that model is saved after `save_freq` number of batches. However, with a generator of `batch_size = 16`, `steps_per_epoch = 100` and `save_freq=20`, I expected to see checkpoints corresonding to 0,20,40....th batches, but after running the code, I see checkpoints for every 2nd batch ie (1,3,5,7.....). I really can't make any sense of how this works. – Rajdeep Dutta Aug 20 '20 at 18:54
5

Explicitly computing the number of batches per epoch worked for me.

BATCH_SIZE = 20
STEPS_PER_EPOCH = train_labels.size / BATCH_SIZE
SAVE_PERIOD = 10

# Create a callback that saves the model's weights every 10 epochs
cp_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_path, 
    verbose=1, 
    save_weights_only=True,
    save_freq= int(SAVE_PERIOD * STEPS_PER_EPOCH))

# Train the model with the new callback
model.fit(train_images, 
          train_labels,
          batch_size=BATCH_SIZE,
          steps_per_epoch=STEPS_PER_EPOCH,
          epochs=50, 
          callbacks=[cp_callback],
          validation_data=(test_images,test_labels),
          verbose=0)
3

The param period mentioned in the accepted answer is now not available anymore.

Using the save_freq param is an alternative, but risky, as mentioned in the docs; e.g., if the dataset size changes, it may become unstable: Note that if the saving isn't aligned to epochs, the monitored metric may potentially be less reliable (again taken from the docs).

Thus, I use a subclass as a solution:

class EpochModelCheckpoint(tf.keras.callbacks.ModelCheckpoint):

    def __init__(self,
                 filepath,
                 frequency=1,
                 monitor='val_loss',
                 verbose=0,
                 save_best_only=False,
                 save_weights_only=False,
                 mode='auto',
                 options=None,
                 **kwargs):
        super(EpochModelCheckpoint, self).__init__(filepath, monitor, verbose, save_best_only, save_weights_only,
                                                   mode, "epoch", options)
        self.epochs_since_last_save = 0
        self.frequency = frequency

    def on_epoch_end(self, epoch, logs=None):
        self.epochs_since_last_save += 1
        # pylint: disable=protected-access
        if self.epochs_since_last_save % self.frequency == 0:
            self._save_model(epoch=epoch, batch=None, logs=logs)

    def on_train_batch_end(self, batch, logs=None):
        pass

use it as

callbacks=[
     EpochModelCheckpoint("/your_save_location/epoch{epoch:02d}", frequency=10),
]

Note that, dependent on your TF version, you may have to change the args in the call to the superclass __init__.

miwe
  • 543
  • 3
  • 16
0

I came here looking for this answer too and wanted to point out a couple changes from previous answers. I am using TF version 2.5.0 currently and period= is working but only if there is no save_freq= in the callback.

my_callbacks = [
    keras.callbacks.ModelCheckpoint(
        filepath=path
        period=N
    )
]

This is working for me with no issues even though period is not documented in the callback documentation

  • Thanks for the update. Hasn't it been removed yet? It was marked as deprecated and I would imagine it would be removed by now. Is it still deprecated? – Nagabhushan S N Jul 18 '22 at 08:20
  • 3
    As of TF Ver 2.5.0 it's still there and working. It is still shown as deprecated `WARNING:tensorflow:'period' argument is deprecated. Please use 'save_freq' to specify the frequency in number of batches seen.` However it is still saving as it should – Andrew Crouch Jul 18 '22 at 09:24