How to continue training after loading model on multiple GPUs in Tensorflow 2.0 with Keras API?

Question

I trained a text classification model consisting RNN in Tensorflow 2.0 with Keras API. I trained this model on multiple GPUs(2) using tf.distribute.MirroredStrategy() from here. I saved the checkpoint of the model using tf.keras.callbacks.ModelCheckpoint('file_name.h5') after every epoch. Now, I want to continue training where I left off on same number of GPUs from the last checkpoint I saved. After loading the checkpoint inside tf.distribute.MirroredStrategy() like this-

mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
   model =tf.keras.models.load_model('file_name.h5')

, it is throwing following error.

File "model_with_tfsplit.py", line 94, in <module>
    model =tf.keras.models.load_model('TF_model_onfull_2_03.h5') # Loading for retraining
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/keras/saving/save.py", line 138, in load_model
    return hdf5_format.load_model_from_hdf5(filepath, custom_objects, compile)
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/keras/saving/hdf5_format.py", line 187, in load_model_from_hdf5
    model._make_train_function()
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/keras/engine/training.py", line 2015, in _make_train_function
    params=self._collected_trainable_weights, loss=self.total_loss)
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/keras/optimizer_v2/optimizer_v2.py", line 500, in get_updates
    grads = self.get_gradients(loss, params)
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/keras/optimizer_v2/optimizer_v2.py", line 391, in get_gradients
    grads = gradients.gradients(loss, params)
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/ops/gradients_impl.py", line 158, in gradients
    unconnected_gradients)
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/ops/gradients_util.py", line 541, in _GradientsHelper
    for x in xs
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/distribute/values.py", line 716, in handle
    raise ValueError("`handle` is not available outside the replica context"
ValueError: `handle` is not available outside the replica context or a `tf.distribute.Strategy.update()` call

Now I am not sure where the problem is. ALso, if I do not use this mirror strategy for using multiple GPUs, then the training starts from beginning but after few steps it reaches the same accuracy and loss value like before the model was saved. Although not sure if this behaviour is normal or not.

Thank You! Rishabh Sahrawat

score 1 · Answer 1 · answered Aug 23 '19 at 09:37

1

Create the model under the distributed scope and then use load_weights method. In this example get_model return an instance of tf.keras.Model

def get_model():
    ...
    return model

mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
    model = get_model()
    model.load_weights('file_name.h5')
    model.compile(...)
model.fit(...)

answered Aug 23 '19 at 09:37

Srihari Humbarwadi

2,532
1
10
28

Do you think `model.compile` will reset the optimizer's weights? – Rishabh Sahrawat Aug 23 '19 at 15:44

score 1 · Answer 2 · answered May 05 '20 at 11:09

I solved it similar to @Srihari Humbarwadi but with the difference of moving the strategy scope inside the get_model function. it is described similar at TF's docu:

def get_model(strategy):
    with strategy.scope():
    ...
    return model

and call it before training with:

strategy = tf.distribute.MirroredStrategy()
model = get_model(strategy)
model.load_weights('file_name.h5')

unfortunately just calling

model =tf.keras.models.load_model('file_name.h5')

does not enable multi GPU training. My guess is that its related to the .h5 model format. maybe it works with tensorflow native .pb format.

Thank you. I will give this a try! – Rishabh Sahrawat Oct 14 '20 at 11:44 — Rishabh Sahrawat, Oct 14 '20 at 11:44

How to continue training after loading model on multiple GPUs in Tensorflow 2.0 with Keras API?

2 Answers2