I trained a text classification model consisting RNN in Tensorflow 2.0 with Keras API. I trained this model on multiple GPUs(2) using tf.distribute.MirroredStrategy()
from here. I saved the checkpoint of the model using tf.keras.callbacks.ModelCheckpoint('file_name.h5')
after every epoch.
Now, I want to continue training where I left off on same number of GPUs from the last checkpoint I saved. After loading the checkpoint inside tf.distribute.MirroredStrategy()
like this-
mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
model =tf.keras.models.load_model('file_name.h5')
, it is throwing following error.
File "model_with_tfsplit.py", line 94, in <module>
model =tf.keras.models.load_model('TF_model_onfull_2_03.h5') # Loading for retraining
File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/keras/saving/save.py", line 138, in load_model
return hdf5_format.load_model_from_hdf5(filepath, custom_objects, compile)
File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/keras/saving/hdf5_format.py", line 187, in load_model_from_hdf5
model._make_train_function()
File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/keras/engine/training.py", line 2015, in _make_train_function
params=self._collected_trainable_weights, loss=self.total_loss)
File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/keras/optimizer_v2/optimizer_v2.py", line 500, in get_updates
grads = self.get_gradients(loss, params)
File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/keras/optimizer_v2/optimizer_v2.py", line 391, in get_gradients
grads = gradients.gradients(loss, params)
File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/ops/gradients_impl.py", line 158, in gradients
unconnected_gradients)
File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/ops/gradients_util.py", line 541, in _GradientsHelper
for x in xs
File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/distribute/values.py", line 716, in handle
raise ValueError("`handle` is not available outside the replica context"
ValueError: `handle` is not available outside the replica context or a `tf.distribute.Strategy.update()` call
Now I am not sure where the problem is. ALso, if I do not use this mirror strategy for using multiple GPUs, then the training starts from beginning but after few steps it reaches the same accuracy and loss value like before the model was saved. Although not sure if this behaviour is normal or not.
Thank You! Rishabh Sahrawat