Resume training with multi_gpu_model in Keras

Question

I'm training a modified InceptionV3 model with the multi_gpu_model in Keras, and I use model.save to save the whole model.

Then I closed and restarted the IDE and used load_model to reinstantiate the model.

The problem is that I am not able to resume the training exactly where I left off.

Here is the code:

parallel_model = multi_gpu_model(model, gpus=2)

parallel_model.compile(optimizer='rmsprop', loss='categorical_crossentropy')

history = parallel_model.fit_generator(generate_batches(path), steps_per_epoch = num_images/batch_size, epochs = num_epochs)

model.save('my_model.h5')

Before the IDE closed, the loss is around 0.8.

After restarting the IDE, reloading the model and re-running the above code, the loss became 1.5.

But, according to the Keras FAQ, model_save should save the whole model (architecture + weights + optimizer state), and load_model should return a compiled model that is identical to the previous one.

So I don't understand why the loss becomes larger after resuming the training.

EDIT: If I don't use the multi_gpu_model and just use the ordinary model, I'm able to resume exactly where I left off.

I'm facing the same issue. Were you able to find a solution to this ? — simha, Apr 16 '18 at 19:18

score 1 · Answer 1 · answered Jul 17 '18 at 13:26

When you call multi_gpu_model(...), Keras automatically sets the weights of your model to some default values (at least in the version 2.2.0 which I am currently using). That's why you were not able to resume the training at the same point as it was when you saved it.

I just solved the issue by replacing the weights of the parallel model with the weights from the sequential model:

parallel_model = multi_gpu_model(model, gpus=2)

parallel_model.layers[-2].set_weights(model.get_weights()) # you can check the index of the sequential model with parallel_model.summary()

parallel_model.compile(optimizer='rmsprop', loss='categorical_crossentropy')

history = parallel_model.fit_generator(generate_batches(path), steps_per_epoch = num_images/batch_size, epochs = num_epochs)

I hope this will help you.

score 0 · Answer 2 · answered Jan 04 '19 at 08:03

0

@saul19am When you compile it, you can only load the weights and the model structure, but you still lose the optimizer_state. I think this can help.

answered Jan 04 '19 at 08:03

Chuong Nguyen

9
1

Your answer looks like a comment to me. Please do not answer with a comment. Understandably, your rep is too low to comment, but that still does not mean answers should be used to make comments as an alternative. – Dang Nguyen Jan 04 '19 at 08:25

Resume training with multi_gpu_model in Keras

2 Answers2