56

I have a model that I've trained for 40 epochs. I kept checkpoints for each epochs, and I have also saved the model with model.save(). The code for training is:

n_units = 1000
model = Sequential()
model.add(LSTM(n_units, input_shape=(None, vec_size), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(n_units, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(n_units))
model.add(Dropout(0.2))
model.add(Dense(vec_size, activation='linear'))
model.compile(loss='mean_squared_error', optimizer='adam')
# define the checkpoint
filepath="word2vec-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]
# fit the model
model.fit(x, y, epochs=40, batch_size=50, callbacks=callbacks_list)

However, when I load the model and try training it again, it starts all over as if it hasn't been trained before. The loss doesn't start from the last training.

What confuses me is when I load the model and redefine the model structure and use load_weight, model.predict() works well. Thus, I believe the model weights are loaded:

model = Sequential()
model.add(LSTM(n_units, input_shape=(None, vec_size), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(n_units, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(n_units))
model.add(Dropout(0.2))
model.add(Dense(vec_size, activation='linear'))
filename = "word2vec-39-0.0027.hdf5"
model.load_weights(filename)
model.compile(loss='mean_squared_error', optimizer='adam')

However, When I continue training with this, the loss is as high as the initial stage:

filepath="word2vec-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]
# fit the model
model.fit(x, y, epochs=40, batch_size=50, callbacks=callbacks_list)

I searched and found some examples of saving and loading models here and here. However, none of them work.


Update 1

I looked at this question, tried it and it works:

model.save('partly_trained.h5')
del model
load_model('partly_trained.h5')

But when I close Python and reopen it, then run load_model again, it fails. The loss is as high as the initial state.


Update 2

I tried Yu-Yang's example code and it works. However, when I use my code again, it still failed.

This is result form the original training. The second epoch should start with loss = 3.1***:

13700/13846 [============================>.] - ETA: 0s - loss: 3.0519
13750/13846 [============================>.] - ETA: 0s - loss: 3.0511
13800/13846 [============================>.] - ETA: 0s - loss: 3.0512Epoch 00000: loss improved from inf to 3.05101, saving model to LPT-00-3.0510.h5

13846/13846 [==============================] - 81s - loss: 3.0510    
Epoch 2/60

   50/13846 [..............................] - ETA: 80s - loss: 3.1754
  100/13846 [..............................] - ETA: 78s - loss: 3.1174
  150/13846 [..............................] - ETA: 78s - loss: 3.0745

I closed Python, reopened it, loaded the model with model = load_model("LPT-00-3.0510.h5") then train with:

filepath="LPT-{epoch:02d}-{loss:.4f}.h5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]
# fit the model
model.fit(x, y, epochs=60, batch_size=50, callbacks=callbacks_list)

The loss starts with 4.54:

Epoch 1/60
   50/13846 [..............................] - ETA: 162s - loss: 4.5451
   100/13846 [..............................] - ETA: 113s - loss: 4.3835
user5305519
  • 3,008
  • 4
  • 26
  • 44
David
  • 819
  • 1
  • 11
  • 14
  • 3
    Did you call `model.compile(optimizer='adam')` after `load_model()`? If so, don't do that. Re-compiling the model with the option `optimizer='adam'` will reset the inner state of the optimizer (in fact, a new Adam optimizer instance is created) – Yu-Yang Jul 31 '17 at 13:13
  • 2
    Thanks for your answer. But no, I didn't call `model.compile` again. All I did after re-opening python was `model = load_model('partly_trained.h5')` and `model.fit(x, y, epochs=20, batch_size=100)` – David Aug 01 '17 at 01:36
  • 1
    I also tried redefining model structure and `model.load_weight('checkpoint.hff5')` and `model.compile(loss='categorical_crossentropy')`. But it gives an error says optimizor must be given. – David Aug 01 '17 at 01:40

8 Answers8

63

As it's quite difficult to clarify where the problem is, I created a toy example from your code, and it seems to work alright.

import numpy as np
from numpy.testing import assert_allclose
from keras.models import Sequential, load_model
from keras.layers import LSTM, Dropout, Dense
from keras.callbacks import ModelCheckpoint

vec_size = 100
n_units = 10

x_train = np.random.rand(500, 10, vec_size)
y_train = np.random.rand(500, vec_size)

model = Sequential()
model.add(LSTM(n_units, input_shape=(None, vec_size), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(n_units, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(n_units))
model.add(Dropout(0.2))
model.add(Dense(vec_size, activation='linear'))
model.compile(loss='mean_squared_error', optimizer='adam')

# define the checkpoint
filepath = "model.h5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

# fit the model
model.fit(x_train, y_train, epochs=5, batch_size=50, callbacks=callbacks_list)

# load the model
new_model = load_model(filepath)
assert_allclose(model.predict(x_train),
                new_model.predict(x_train),
                1e-5)

# fit the model
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]
new_model.fit(x_train, y_train, epochs=5, batch_size=50, callbacks=callbacks_list)

The loss continues to decrease after model loading. (restarting python also gives no problem)

Using TensorFlow backend.
Epoch 1/5
500/500 [==============================] - 2s - loss: 0.3216     Epoch 00000: loss improved from inf to 0.32163, saving model to model.h5
Epoch 2/5
500/500 [==============================] - 0s - loss: 0.2923     Epoch 00001: loss improved from 0.32163 to 0.29234, saving model to model.h5
Epoch 3/5
500/500 [==============================] - 0s - loss: 0.2542     Epoch 00002: loss improved from 0.29234 to 0.25415, saving model to model.h5
Epoch 4/5
500/500 [==============================] - 0s - loss: 0.2086     Epoch 00003: loss improved from 0.25415 to 0.20860, saving model to model.h5
Epoch 5/5
500/500 [==============================] - 0s - loss: 0.1725     Epoch 00004: loss improved from 0.20860 to 0.17249, saving model to model.h5

Epoch 1/5
500/500 [==============================] - 0s - loss: 0.1454     Epoch 00000: loss improved from inf to 0.14543, saving model to model.h5
Epoch 2/5
500/500 [==============================] - 0s - loss: 0.1289     Epoch 00001: loss improved from 0.14543 to 0.12892, saving model to model.h5
Epoch 3/5
500/500 [==============================] - 0s - loss: 0.1169     Epoch 00002: loss improved from 0.12892 to 0.11694, saving model to model.h5
Epoch 4/5
500/500 [==============================] - 0s - loss: 0.1097     Epoch 00003: loss improved from 0.11694 to 0.10971, saving model to model.h5
Epoch 5/5
500/500 [==============================] - 0s - loss: 0.1057     Epoch 00004: loss improved from 0.10971 to 0.10570, saving model to model.h5

BTW, redefining the model followed by load_weight() definitely won't work, because save_weight() and load_weight() does not save/load the optimizer.

Josef
  • 2,869
  • 2
  • 22
  • 23
Yu-Yang
  • 14,539
  • 2
  • 55
  • 62
  • I tried your toy code, it works. But moving back to my code, it still fails... I think I'm doing exactly the same as your example. I don't understand why. Please see my update for details. – David Aug 01 '17 at 04:37
  • Just a random guess, are you using the same `(x, y)` before and after model loading? – Yu-Yang Aug 01 '17 at 06:00
  • Yes. I literally closed Python and reopen, reload the data. – David Aug 01 '17 at 06:08
  • Sorry, Yu Yang. The problem has nothing to do with Keras. I've figured out why. But your answer does help me with the optimizer reloading. Now I can save the model and continue training! Thanks a lot. – David Aug 02 '17 at 05:10
  • 9
    @David So, what was the problem? – Leonid Dashko Mar 25 '18 at 13:12
  • Just to add to your very good answer:.. `filepath` can contain named formatting options, which will be filled the value of epoch and keys in logs (passed in on_epoch_end). For example: if filepath is `weights.{epoch:02d}-{val_loss:.2f}.hdf5`, then the model checkpoints will be saved with the epoch number and the validation loss in the filename. Also, you can keep track of the training history. Sample code here: https://www.kaggle.com/morenoh149/keras-continue-training – Rami Alloush Nov 10 '18 at 00:09
  • 8
    @David tell us what was the problem DAVIDDDD – shivam13juna Jan 06 '19 at 05:27
  • There should be a `fiting` object and that is what should be check-pointed. – user3673 Dec 28 '19 at 20:55
  • Your example continues training from the best model, not from the last model. If you load the model from the last but use the ModelCheckpoint with save_best_only=True, it won't have the data of the best model anymore. You can achieve this by having 2 ModelCheckpoint, one for saving the best and one for the last. – Rodrigo Ruiz Apr 10 '20 at 20:38
  • Also, load_model doesn't work either, it gives me the error `WARNING:tensorflow:Error in loading the saved optimizer state. As a result, your model is starting with a freshly initialized optimizer.`. – Rodrigo Ruiz Apr 10 '20 at 20:41
7

I compared my code with this example http://machinelearningmastery.com/text-generation-lstm-recurrent-neural-networks-python-keras/ by carefully block out line-by-line and run again. After a whole day, finally, I found what was wrong.

When making char-int mapping, I used

# title_str_reduced is a string
chars = list(set(title_str_reduced))
# make char to int index mapping
char2int = {}
for i in range(len(chars)):
    char2int[chars[i]] = i    

A set is an unordered data structure. In python, when a set is converted to a list which is ordered, the order is randamly given. Thus my char2int dictionary is randomized everytime when I reopen python. I fixed my code by adding a sorted()

chars = sorted(list(set(title_str_reduced)))

This forces the conversion to a fixed order.

David
  • 819
  • 1
  • 11
  • 14
  • thank you for this. I had the exact same trouble. Always started from beginning after each restart, but unbelieveably not during the same session, even after .save, .load . I didn't figure it out by myself but thank you, after a few lost days, found your answer, this saves me! :thanks: – devplayer Feb 03 '20 at 10:53
5

The above answer uses tensorflow 1.x. Here is an updated version using Tensorflow 2.x.

import numpy as np
from numpy.testing import assert_allclose
from tensorflow.keras.models import Sequential, load_model
from tensorflow.keras.layers import LSTM, Dropout, Dense
from tensorflow.keras.callbacks import ModelCheckpoint

vec_size = 100
n_units = 10

x_train = np.random.rand(500, 10, vec_size)
y_train = np.random.rand(500, vec_size)

model = Sequential()
model.add(LSTM(n_units, input_shape=(None, vec_size), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(n_units, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(n_units))
model.add(Dropout(0.2))
model.add(Dense(vec_size, activation='linear'))
model.compile(loss='mean_squared_error', optimizer='adam')

# define the checkpoint
filepath = "model.h5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

# fit the model
model.fit(x_train, y_train, epochs=5, batch_size=50, callbacks=callbacks_list)

# load the model
new_model = load_model("model.h5")
assert_allclose(model.predict(x_train),
                new_model.predict(x_train),
                1e-5)

# fit the model
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]
new_model.fit(x_train, y_train, epochs=5, batch_size=50, callbacks=callbacks_list)
Mrinal Jain
  • 413
  • 1
  • 5
  • 10
  • Well this Code gives an Error on my Model due to ```model.predict(x_train)``` not being equal to ```new_model.predict(x_train)``` - same goes the other way around I do in fact have a different Model-Setup using simple Conv2D, Flatten, Dense and MaxPooling2D, so if thats the issue, what would I need to do instead? – Rovetown Nov 04 '21 at 19:34
5

The checkmarked Answer is not correct; the real problem is more subtle.

When you create a ModelCheckpoint() , check the best:

cp1 = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
print(cp1.best)

you will see that this is set to np.inf, which unfortunately is not your last best when you stopped training. So when you re-train and recreate the ModelCheckpoint(), if you call fit and if the loss is less than previously known value, then it seems to work, but in more complex problems you will end up saving a bad model and lose the best.

You can fix this by overwriting the cp.best parameter as shown below:

import numpy as np
from numpy.testing import assert_allclose
from keras.models import Sequential, load_model
from keras.layers import LSTM, Dropout, Dense
from keras.callbacks import ModelCheckpoint

vec_size = 100
n_units = 10

x_train = np.random.rand(500, 10, vec_size)
y_train = np.random.rand(500, vec_size)

model = Sequential()
model.add(LSTM(n_units, input_shape=(None, vec_size), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(n_units, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(n_units))
model.add(Dropout(0.2))
model.add(Dense(vec_size, activation='linear'))
model.compile(loss='mean_squared_error', optimizer='adam')

# define the checkpoint
filepath = "model.h5"
cp1= ModelCheckpoint(filepath=filepath, monitor='loss',     save_best_only=True, verbose=1, mode='min')
callbacks_list = [cp1]

# fit the model
model.fit(x_train, y_train, epochs=5, batch_size=50, shuffle=True, validation_split=0.1, callbacks=callbacks_list)

# load the model
new_model = load_model(filepath)
#assert_allclose(model.predict(x_train),new_model.predict(x_train), 1e-5)
score = model.evaluate(x_train, y_train, batch_size=50)
cp1 = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
cp1.best = score # <== ****THIS IS THE KEY **** See source for  ModelCheckpoint

# fit the model
callbacks_list = [cp1]
new_model.fit(x_train, y_train, epochs=5, batch_size=50, callbacks=callbacks_list)
Piotr Siupa
  • 3,929
  • 2
  • 29
  • 65
user30012
  • 51
  • 1
  • 2
3

I think you can write

model.save('partly_trained.h5' )

and

model = load_model('partly_trained.h5')

instead of

model = Sequential()
model.add(LSTM(n_units, input_shape=(None, vec_size), return_sequences=True))    
model.add(Dropout(0.2)) 
model.add(LSTM(n_units, return_sequences=True))  
model.add(Dropout(0.2)) 
model.add(LSTM(n_units))
model.add(Dropout(0.2))
model.add(Dense(vec_size, activation='linear')) 
model.compile(loss='mean_squared_error', optimizer='adam')

Then go continuing training. Because model.save stores both architecture & weights, as you can read in the documentation.

Aelius
  • 1,029
  • 11
  • 22
bruce
  • 1,286
  • 11
  • 14
  • This did work for me. It was a little deceiving in that is started back at epoch 1 - however - it's initial accuracies and losses were consistent with where it had left off in training (from the last checkpoint). So if this matters, you may want to reduce the number of epochs to reflect this - I haven't been able to find a way to specify "start at epoch X" - but I think this is largely cosmetic. – Brad Oct 12 '21 at 15:07
0

Here is the official kera's Documentation to save a model:

https://keras.io/getting-started/faq/#how-can-i-save-a-keras-model

In this post the author provides two examples of saving and loading your model to file as:

  • JSON format.
  • YAML foramt.
a11apurva
  • 138
  • 10
  • A link to a solution is welcome, but please ensure your answer is useful without it: [add context around the link](http://meta.stackexchange.com/a/8259) so your fellow users will have some idea what it is and why it’s there, then quote the most relevant part of the page you're linking to in case the target page is unavailable. [Answers that are little more than a link may be deleted](http://stackoverflow.com/help/deleted-answers). – mrun Apr 27 '18 at 06:22
  • Thank you for the suggestions, I will keep this in mind. – a11apurva Apr 27 '18 at 09:24
0

assume you have a code like this:

model = some_model_you_made(input_img) # you compiled your model in this 
model.summary()

model_checkpoint = ModelCheckpoint('yours.h5', monitor='val_loss', verbose=1, save_best_only=True)

model_json = model.to_json()
with open("yours.json", "w") as json_file:
    json_file.write(model_json)

model.fit_generator(#stuff...) # or model.fit(#stuff...)

Now turn your code into this:

model = some_model_you_made(input_img) #same model here
model.summary()

model_checkpoint = ModelCheckpoint('yours.h5', monitor='val_loss', verbose=1, save_best_only=True) #same ckeckpoint

model_json = model.to_json()
with open("yours.json", "w") as json_file:
    json_file.write(model_json)

with open('yours.json', 'r') as f:
    old_model = model_from_json(f.read()) # open the model you just saved (same as your last train) with a different name

old_model.load_weights('yours.h5') # the model checkpoint you trained before
old_model.compile(#stuff...) # need to compile again (exactly like the last compile)

# now start training with the checkpoint...
old_model.fit_generator(#same stuff like the last train) # or model.fit(#stuff...)
MeiH
  • 1,763
  • 11
  • 17
0

Since Keras and Tensorflow are now bundled, you can use the newer Tensorflow format that will save all model info including the optimizer and its state (from the doc, emphasis mine):

You can save an entire model to a single artifact. It will include:

  • The model's architecture/config
  • The model's weight values (which were learned during training)
  • The model's compilation information (if compile() was called)
  • The optimizer and its state, if any (this enables you to restart training where you left)

APIs

So once your model is saved that way, you can load it and resume training: it will continue where it left off.

Matthieu
  • 2,736
  • 4
  • 57
  • 87