Is it the desired way to periodically saving checkpoints with Keras model and "SavedModel" format in Tensorflow 2

Question

The Tensorflow 2 documentation states that users could save a Tensorflow Keras Model by calling the API model.save() with either "SavedModel" or "h5" format (latest version 2.4.1: https://www.tensorflow.org/guide/keras/save_and_serialize#whole-model_saving_loading). Now assuming to use the "SavedModel" format, I am wondering if it is by design to periodically save checkpoints with the "SavedModel" format. For example,

import numpy as np
import tensorflow as tf
from tensorflow.keras import Input, Model
from tensorflow.keras.layers import Dense


def get_model() -> Model:
    """
    Define a TF Keras Model with layers having loss associated.
    """
    x_in = Input(shape=(4,), name="input")
    layer1 = Dense(64, name="l1")(x_in)
    layer2 = Dense(64, name="l2")(layer1)
    x_out = Dense(2, name="output")(layer2)
    model = Model(inputs=x_in, outputs=x_out, name="m")
    model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                  optimizer=tf.keras.optimizers.Adam(),
                  metrics=[])
    return model


def train_step(input_data: np.ndarray, label_data: np.ndarray, model: Model) -> None:
    """
    Perform the training steps for the built Keras model, and periodically save
    a "SavedModel" checkpoint -- Is it desired?
    """
    # Perform a training step with input and label data.
    with tf.GradientTape() as tape:
        # A simple pass to mock the training step.
        pass
    # At the end of each training step, we save the updated model.
    # But... is "SavedModel" the desired format for periodic saving checkpoints?
    model.save("/tmp/saved_model/", save_format="tf")


def main() -> None:
    model = get_model()
    # Train 100 epoch
    for epoch in range(100):
        # each for inputs, labels in train_data:
        train_step(input_data=np.array([1.0]), label_data=np.array([1.0]), model=model)
        print(f"Epoch {epoch}, the training metric is...")


main()

I'm asking because my understanding is "SavedModel" is designed for saving a model only when the model is "ready" for deployment for inference (i.e. the model is trained well), and users save the "SavedModel" model only once (or O(1) times) which usually happens in the end. One evidence on this is in Tensorflow 1 directly using tf.Session with Tensorflow graph, if you periodically save the "SavedModel" model like the code below, then SavedModelBuilder leaks one "Saver" node in the Tensorflow graph every time creating the builder:

import tensorflow as tf
from tensorflow import saved_model


def _save_to_saved_model(input_tensor: tf.Tensor, output_tensor: tf.Tensor, tf_session: tf.Session, saved_model_path: str) -> None:
    """
    Save the Tensorflow graph to the Tensorflow saved model.
    """
    # Create the saved model builder.
    builder = saved_model.builder.SavedModelBuilder(saved_model_path)

    # Build the tensor info proto using the tensors.
    tensor_info_obs = saved_model.utils.build_tensor_info(input_tensor)
    tensor_info_output = saved_model.utils.build_tensor_info(output_tensor)

    # Get the default method name.
    method_name = saved_model.signature_constants.PREDICT_METHOD_NAME
    policy_signature = (
        saved_model.signature_def_utils.build_signature_def(
            inputs={"input": tensor_info_obs},
            outputs={"output": tensor_info_output},
            method_name=method_name))

    # Get the signature def map key.
    serving_signature_key = (
        saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY)
    builder.add_meta_graph_and_variables(
        tf_session, [saved_model.tag_constants.SERVING],
        signature_def_map={serving_signature_key: policy_signature})

    # Save the saved model.
    builder.save()


def main() -> None:
    for _ in range(100):
        # Mock each train step by only saving the "SavedModel".
        _save_to_saved_model(...)


main()

Another question posted a few years ago seems also mentioned the same point: How to periodically save tensorflow model using saved_model API?. However, Tensorflow Keras doesn't seem to have the same leaking issue, as looks Model.save() creates TrackableSaver rather than Saver, which doesn't leak the saver node in the Tensorflow graph, but I want to know using Tensorflow Keras Model if it is desired to periodically save checkpoints with "SavedModel" format.

NOTE: "SavedModel" format is being considered because looks it is the only Keras model persistence format that is allowed to restore model without accessing the custom model code.

Thanks!

Saving the model at what interval is depends on the need. On the other hand, saving the model at each training step feels unnecessary unless there is some special condition. If I'm not wrong with you, are you wondering about saving the keras model periodically good or bad? — Innat, Mar 29 '21 at 08:37
The target format is a matter of preference and goals. From my experience .h5 files are faster to save so I usually stick to those. Nothing stops you from converting the chosen .h5 file to TF Saved Model format later on. — sebastian-sz, Mar 29 '21 at 11:54
@M.Innat Perhaps my question not super clear, but to clarify, i'm more curious about if "SavedModel" format is designed for "periodic" checkpoint saving. Also to your point, "saving the model at each training step feels unnecessary unless there is some special condition", well it really depends on the use case, especially when training cycle is long, and needs to check the perf of historical checkpoint "periodically", it's more critical. — Ruofan Kong, Mar 29 '21 at 17:15
@sebastian-sz There's core difference using these two formats -- From the TF doc: "The key difference between HDF5 and SavedModel is that HDF5 uses object configs to save the model architecture, while SavedModel saves the execution graph. Thus, SavedModels are able to save custom objects like subclassed models and custom layers without requiring the original code." -- The _required_ condition to convert ".h5" to "SavedModel" is, you have to access the original custom layer code to do the conversion. — Ruofan Kong, Mar 29 '21 at 17:49

score 1 · Answer 1 · answered Apr 02 '21 at 04:54

"""Perform the training steps for the built Keras model, and periodically save a "SavedModel" checkpoint -- Is it desired? """

No, it's not desired. When you wrote

def train_step(data):
    # Perform a training step with input and label data.
    with tf.GradientTape() as tape:
        # A simple pass to mock the training step.
        pass
    model.save("/tmp/saved_model/", save_format="tf")

for epoch in range(100):
    # each for inputs, labels in train_data:
    train_step(input_data=np.array([1.0]), label_data=np.array([1.0]), model=model)
    print(f"Epoch {epoch}, the training metric is...")

First of all, looping over the train_step in this way is kind of wrong. This should be more like

@tf.function
def train_step(x, y):
    # start the scope of gradient 
    with tf.GradientTape() as tape:
        logits = model(x, training=True)      # forward pass
        train_loss_value = loss_fn(y, logits) # compute loss

for epoch in range(100):
    for img, label in dataset:
         # each for inputs, labels in train_data:
         train_step(input_data=img, 
                    label_data=label)

Here the dataset should produce a single batch and that will be passed to the model. Ok, now if we save the model at each training step, we are saving the whole model for each batch, which normally doesn't make much sense (unless*) because the model gets updated after an epoch by updating the trainable parameter. So saving the model in the middle of the epoch generally not desired.

Here is something that you may find interesting. A built-in callback function to save models in tf. keras API. ModelCheckpoint.

tf.keras.callbacks.ModelCheckpoint(
    filepath, 
    monitor='val_loss', 
    verbose=0, 
    save_best_only=False,
    save_weights_only=False, 
    mode='auto', 
    save_freq='epoch',
    options=None, **kwargs
)

An example

(x_train, y_train), (_, _) = tf.keras.datasets.cifar10.load_data()
x_train = x_train.astype('float32') / 255
y_train = tf.keras.utils.to_categorical(y_train , num_classes=10)

input = keras.Input(shape=(32, 32, 3), name="img1")

x = layers.Conv2D(16, 3, activation="relu")(input)
x = layers.Conv2D(32, 3, activation="relu")(x)
x = layers.GlobalMaxPooling2D()(x)
out = keras.layers.Dense(10, activation='softmax')(x)
encoder = keras.Model( inputs = input, 
                      outputs = out, name="encoder")

model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath='/content/check.h5',    # file .h5
    save_weights_only=True,          # save weght only 
    monitor='categorical_accuracy',
    mode='max',
    verbose=1,
    save_freq='epoch',               # check each epoch 
    save_best_only=True)

encoder.compile(
          loss = tf.keras.losses.CategoricalCrossentropy(),
          metrics = tf.keras.metrics.CategoricalAccuracy(),
          optimizer = tf.keras.optimizers.Adam())
# fit 
encoder.fit(x_train, y_train, batch_size=128, epochs=2, 
            verbose = 1, callbacks=[model_checkpoint_callback])

This callback will call each epoch (save_freq) and based on the model weight (save_weights_only=True) condition, it will save the model at the path filepath in .h5 format. Now, if we want to save the whole model in .h5 format with above condition, we simply set save_weights_only=False. Again, If we want to save the whole model in tf format or SavedModel format with above condition, we simply need to set as follows. A check folder will be generator with concern assets, variables and save_model.pb files.

filepath='/content/check',
save_weights_only=False,

Now, about the save_freq:

'epoch' or integer. When using 'epoch', the callback saves the model after each epoch. When using integer, the callback saves the model at end of this many batches. If the Model is compiled with steps_per_execution=N, then the saving criteria will be checked every Nth batch. Note that if the saving isn't aligned to epochs, the monitored metric may potentially be less reliable (it could reflect as little as 1 batch since the metrics get reset every epoch). Defaults to 'epoch'.

So, if we want to save whole model (or only the weight) batch-wise, we can do per batch model saving as follows in SavedModel format.

model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath='/content/check',
    save_weights_only=False,
    monitor='categorical_accuracy',
    mode='max',
    verbose=1,
    save_freq=128,       
    save_best_only=True 
)

encoder.compile(
          loss = tf.keras.losses.CategoricalCrossentropy(),
          metrics = tf.keras.metrics.CategoricalAccuracy(),
          optimizer = tf.keras.optimizers.Adam())
# fit 
encoder.fit(x_train, y_train, batch_size=128, epochs=1, 
            verbose = 1, callbacks=[model_checkpoint_callback])

128/391 [========>.....................] - ETA: 47s - loss: 1.3787 - categorical_accuracy: 0.4987
Epoch 00001: categorical_accuracy improved from -inf to 0.50153, saving model to /content/checks
INFO:tensorflow:Assets written to: /content/checks/assets
256/391 [==================>...........] - ETA: 25s - loss: 1.3727 - categorical_accuracy: 0.5007
Epoch 00001: categorical_accuracy improved from 0.50153 to 0.50479, saving model to /content/checks
INFO:tensorflow:Assets written to: /content/checks/assets
384/391 [============================>.] - ETA: 1s - loss: 1.3690 - categorical_accuracy: 0.5025
Epoch 00001: categorical_accuracy improved from 0.50479 to 0.50757, saving model to /content/checks
INFO:tensorflow:Assets written to: /content/checks/assets
391/391 [==============================] - 76s 192ms/step - loss: 1.3688 - categorical_accuracy: 0.5026
<tensorflow.python.keras.callbacks.History at 0x7f0f0017dd50>

well first, `train_step` code is not completed but more to show the relationship logic with the saving & that's not the key point. Second, when you say `ModelCheckpoint` or `h5` format, the problem with that is when you define a _custom_ model, you have to restore with accessing the _original custom model code_ because it's not custom model agnostic, while saving & loading with SavedModel doesn't have the limitation, and that looks the only persistence format that TF supports by doing this way, although it does very expensive.. And this property is important to me. — Ruofan Kong, Apr 02 '21 at 05:04
I think you're confused. When I wrote about `ModelCheckpoint`, I started with the statement - **Here is something that you may find interesting....** - It's not the actual answer but adding some information for you. And savng the custom object may be an issue with this but nowhere in your question you ever mention it. — Innat, Apr 02 '21 at 05:17
This is why i posted question here to ask. Without the custom model concern, i guess perhaps everyone would know how to do it although some other aspects may still be considerations with different saving formats. Given the limitation, looks only saved_model satisfies while i don't think that's a good way either on periodic saving given it's super expensive. But apparently no other ways on it yet and that's why i'm seeking an authority answer on if it is by design to save it periodically or not. Does it make sense? — Ruofan Kong, Apr 02 '21 at 05:42
No, it doesn't make sense. You should update your question to make it more precise and on point in order to get help from others. — Innat, Apr 02 '21 at 05:49
A note added in the question that incorporates the limitation of other formats. — Ruofan Kong, Apr 02 '21 at 06:09

Is it the desired way to periodically saving checkpoints with Keras model and "SavedModel" format in Tensorflow 2

1 Answers1

Linked