Python TensorFlow: How to restart training with optimizer and import_meta_graph?

Question

I'm trying to restart a model training in TensorFlow by picking up where it left off. I'd like to use the recently added (0.12+ I think) import_meta_graph() so as to not reconstruct the graph.

I've seen solutions for this, e.g. Tensorflow: How to save/restore a model?, but I run into issues with AdamOptimizer, specifically I get a ValueError: cannot add op with name <my weights variable name>/Adam as that name is already used error. This can be fixed by initializing, but then my model values are cleared!

There are other answers and some full examples out there, but they always seem older and so don't include the newer import_meta_graph() approach, or don't have a non-tensor optimizer. The closest question I could find is tensorflow: saving and restoring session but there is no final clear cut solution and the example is pretty complicated.

Ideally I'd like a simple run-able example starting from scratch, stopping, then picking up again. I have something that works (below), but do also wonder if I'm missing something. Surely I'm not the only one doing this?

I had the same issue with AdamOptimizer. I managed to get things to work by putting my ops in collections. This example helped me a lot: http://www.seaandsailor.com/tensorflow-checkpointing.html — Anjum Sayed, Apr 14 '17 at 13:44

score 8 · Accepted Answer · answered Apr 06 '17 at 00:12

Here is what I came up with from reading the docs, other similar solutions, and trial and error. It's a simple autoencoder on random data. If ran, then ran again, it will continue from where it left off (i.e. cost function on first run goes from ~0.5 -> 0.3 second run starts ~0.3). Unless I missed something, all of the saving, constructors, model building, add_to_collection there are needed and in a precise order, but there may be a simpler way.

And yes, loading the graph with import_meta_graph isn't really needed here since the code is right above, but is what I want in my actual application.

from __future__ import print_function
import tensorflow as tf
import os
import math
import numpy as np

output_dir = "/root/Data/temp"
model_checkpoint_file_base = os.path.join(output_dir, "model.ckpt")

input_length = 10
encoded_length = 3
learning_rate = 0.001
n_epochs = 10
n_batches = 10
if not os.path.exists(model_checkpoint_file_base + ".meta"):
    print("Making new")
    brand_new = True

    x_in = tf.placeholder(tf.float32, [None, input_length], name="x_in")
    W_enc = tf.Variable(tf.random_uniform([input_length, encoded_length],
                                          -1.0 / math.sqrt(input_length),
                                          1.0 / math.sqrt(input_length)), name="W_enc")
    b_enc = tf.Variable(tf.zeros(encoded_length), name="b_enc")
    encoded = tf.nn.tanh(tf.matmul(x_in, W_enc) + b_enc, name="encoded")
    W_dec = tf.transpose(W_enc, name="W_dec")
    b_dec = tf.Variable(tf.zeros(input_length), name="b_dec")
    decoded = tf.nn.tanh(tf.matmul(encoded, W_dec) + b_dec, name="decoded")
    cost = tf.sqrt(tf.reduce_mean(tf.square(decoded - x_in)), name="cost")

    saver = tf.train.Saver()
else:
    print("Reloading existing")
    brand_new = False
    saver = tf.train.import_meta_graph(model_checkpoint_file_base + ".meta")
    g = tf.get_default_graph()
    x_in = g.get_tensor_by_name("x_in:0")
    cost = g.get_tensor_by_name("cost:0")


sess = tf.Session()
if brand_new:
    optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)
    init = tf.global_variables_initializer()
    sess.run(init)
    tf.add_to_collection("optimizer", optimizer)
else:
    saver.restore(sess, model_checkpoint_file_base)
    optimizer = tf.get_collection("optimizer")[0]

for epoch_i in range(n_epochs):
    for batch in range(n_batches):
        batch = np.random.rand(50, input_length)
        _, curr_cost = sess.run([optimizer, cost], feed_dict={x_in: batch})
        print("batch_cost:", curr_cost)
        save_path = tf.train.Saver().save(sess, model_checkpoint_file_base)

score 2 · Answer 2 · edited Apr 08 '18 at 18:48

I had the same issue and I just figured out what was wrong, at least in my code.

In the end, I used the wrong file name in saver.restore(). This function must be given the file name without the file extension, just like the saver.save() function:

saver.restore(sess, 'model-1')

instead of

saver.restore(sess, 'model-1.data-00000-of-00001')

With this I do exactly what you wish to do: starting from scratch, stopping, then picking up again. I don't need to initialize a second saver from a meta file using the tf.train.import_meta_graph() function, and I don't need to explicitly state tf.initialize_all_variables() after initializing the optimizer.

My complete model restore looks like this:

with tf.Session() as sess:
    saver = tf.train.Saver()
    sess.run(tf.global_variables_initializer())
    saver.restore(sess, 'model-1')

I think in protocol V1 you still had to add the .ckpt to the file name, and for import_meta_graph() you still need to add the .meta, which might cause some confusion among users. Maybe this should be pointed out more explicitly in the documentation.

Kate Shin · Answer 3 · 2017-10-10T15:57:49.840

There might be a problem when you are creating the saver object at the restoring session.

I obtained the same error as yours when using codes below in the restoring session.

saver = tf.train.import_meta_graph('tmp/hsmodel.meta')
saver.restore(sess, tf.train.latest_checkpoint('tmp/'))

But when I changed in this way,

saver = tf.train.Saver()
saver.restore(sess, "tmp/hsmodel")

The error has gone away. The "tmp/hsmodel" is the path that I give to the saver.save(sess,"tmp/hsmodel") in the saving session.

An simple examples on storing and restoring session of training MNIST network(containing Adam optimizer) is in here. This was helpful to me to compare with my code and fix the problem.

https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/4_Utils/save_restore_model.py

score 0 · Answer 4 · answered Aug 03 '18 at 05:01

0

The saver class allows us to save a session via: saver.save(sess, "checkpoints.ckpt")

And allows us to restore the session: saver.restore(sess, tf.train.latest_checkpoint("checkpoints.ckpt"))

answered Aug 03 '18 at 05:01

ravishankar.rj

9
1

Python TensorFlow: How to restart training with optimizer and import_meta_graph?

4 Answers4

Linked