TensorFlow on Jupyter: Can't restore variables

Question

I can't seem to be able to restore saved variables when using TensorFlow in a Jupyter notebook. I train an ANN, then I run saver.save(sess, "params1.ckpt") then I train it again, save the new result saver.save(sess, "params2.ckpt") but when I run saver.restore(sess, "params1.ckpt") my model doesn't load the values saved on params1.ckpt and keeps those in params2.ckpt.

If I run the model, save it on params.ckpt, then close and halt, then try to load it again, I get the following error:

---------------------------------------------------------------------------
StatusNotOK                               Traceback (most recent call last)
StatusNotOK: Not found: Tensor name "Variable/Adam" not found in checkpoint files params.ckpt
     [[Node: save/restore_slice_1 = RestoreSlice[dt=DT_FLOAT, preferred_shard=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/restore_slice_1/tensor_name, save/restore_slice_1/shape_and_slice)]]

During handling of the above exception, another exception occurred:

SystemError                               Traceback (most recent call last)
<ipython-input-6-39ae6b7641bd> in <module>()
----> 1 saver.restore(sess, "params.ckpt")

/usr/local/lib/python3.5/site-packages/tensorflow/python/training/saver.py in restore(self, sess, save_path)
    889       save_path: Path where parameters were previously saved.
    890     """
--> 891     sess.run([self._restore_op_name], {self._filename_tensor_name: save_path})
    892 
    893 

/usr/local/lib/python3.5/site-packages/tensorflow/python/client/session.py in run(self, fetches, feed_dict)
    366 
    367     # Run request and get response.
--> 368     results = self._do_run(target_list, unique_fetch_targets, feed_dict_string)
    369 
    370     # User may have fetched the same tensor multiple times, but we

/usr/local/lib/python3.5/site-packages/tensorflow/python/client/session.py in _do_run(self, target_list, fetch_list, feed_dict)
    426 
    427       return tf_session.TF_Run(self._session, feed_dict, fetch_list,
--> 428                                target_list)
    429 
    430     except tf_session.StatusNotOK as e:

SystemError: <built-in function delete_Status> returned a result with an error set

My code for training is:

def weight_variable(shape, name):
  initial = tf.truncated_normal(shape, stddev=1.0, name=name)
  return tf.Variable(initial)

def bias_variable(shape, name):
  initial = tf.constant(1.0, shape=shape)
  return tf.Variable(initial, name=name)

input_file = pd.read_csv('P2R0PC0.csv') 
features = #vector with 5 feature names
targets = #vector with 4 feature names
x_data = input_file.as_matrix(features)
t_data = input_file.as_matrix(targets)

x = tf.placeholder(tf.float32, [None, x_data.shape[1]])

hiddenDim = 5

b1 = bias_variable([hiddenDim], name = "b1")
W1 = weight_variable([x_data.shape[1], hiddenDim], name = "W1")

b2 = bias_variable([t_data.shape[1]], name = "b2")
W2 = weight_variable([hiddenDim, t_data.shape[1]], name = "W2")

hidden = tf.nn.sigmoid(tf.matmul(x, W1) + b1)
y = tf.nn.sigmoid(tf.matmul(hidden, W2) + b2)
t = tf.placeholder(tf.float32, [None, t_data.shape[1]])

lambda1 = 1
beta1 = 1
lambda2 = 1
beta2 = 1
error = -tf.reduce_sum(t * tf.log(tf.clip_by_value(y,1e-10,1.0)) + (1 - t) * tf.log(tf.clip_by_value(1 - y,1e-10,1.0)))
complexity = lambda1 * tf.nn.l2_loss(W1) + beta1 * tf.nn.l2_loss(b1) + lambda2 * tf.nn.l2_loss(W2) + beta2 * tf.nn.l2_loss(b2)
loss = error + complexity

train_step = tf.train.AdamOptimizer(0.001).minimize(loss)
sess = tf.Session()

init = tf.initialize_all_variables()
sess.run(init)

ran = 25001
delta = 250

plot_data = np.zeros(int(ran / delta + 1))
k = 0;
for i in range(ran):
    train_step.run({x: data, t: labels}, sess)
    if i % delta == 0:
        plot_data[k] = loss.eval({x: data, t: labels}, sess)
        #plot_training[k] = loss.eval({x: x_test, t: t_test}, sess)
        print(str(plot_data[k]))
        k = k + 1

plt.plot(np.arange(start=2, stop=int(ran / delta + 1)), plot_data[2:])

saver = tf.train.Saver()
saver.save(sess, "params.ckpt")

error.eval({x:data, t: labels}, session=sess)

Am I doing anything wrong? Why can't I ever restore my variables?

Do you build multiple copies of the same graph in the same process? If so, it's possible that the names of the tensors in the different checkpoints are different, which causes a mismatch when you try to restore them. — mrry, Jan 11 '16 at 20:53
What do you mean by build multiple copies? I mean, every time I go to bed and then return to my computer I have to run the entire code again from scratch, so I do have to rebuild the graph, but the names should be working? Except I just realised I wrote name=name on a different place than I intended to in the weight_variable function, so I'll see if that's the problem... — Pedro Carvalho, Jan 13 '16 at 11:12
I mean do you execute the code for training multiple times in the same process (i.e. from some outer function not shown)? — mrry, Jan 13 '16 at 16:15
Ah, now I see that you're using Jupyter, which suggests a [possible answer](http://stackoverflow.com/a/34771937/3574081). — mrry, Jan 13 '16 at 16:29
I am having exactly the same issue. Came across this posting, but no success yet: http://stackoverflow.com/questions/33759623/tensorflow-how-to-restore-a-previously-saved-model-python — sray, Jan 12 '16 at 21:14

score 8 · Accepted Answer · edited May 23 '17 at 11:54

It looks like you are using Jupyter to build your model. One possible issue, when constructing a tf.Saver with the default arguments is that it will use the (auto-generated) names for the variables as the keys in your checkpoint. Since in Jupyter its easy to re-execute code cells multiple times, you might be ending up with multiple copies of the variable nodes in the session that you save. See my answer to this question for an explanation of what can go wrong.

There are a few possible solutions. Here are the easiest:

Call tf.reset_default_graph() before you build your model (and the Saver). This will ensure that the variables get the names you intended, but it will invalidate previously-created graphs.
Use explicit arguments to tf.train.Saver() to specify the persistent names for the variables. For your example this shouldn't be too hard (though it becomes unwieldy for larger models):
```
saver = tf.train.Saver(var_list={"b1": b1, "W1": W1, "b2": b2, "W2": W2})
```
Create a new tf.Graph() and make it the default each time you create the model. This can be tricky in Jupyter, since it forces you to put all of the model building code in one cell, but it works well for scripts:
```
with tf.Graph().as_default():
  # Model building and training/evaluation code goes here.
```

There doesn't seem to be any function called "reset_default_graph()", and I'd rather not use the explicit arguments for the reasons you explained. — Pedro Carvalho, Jan 14 '16 at 15:20
Does the third option work for you, and if not can you try the second option to see if it fixes your problem? It looks like you'll have to [install from source](https://www.tensorflow.org/versions/master/get_started/os_setup.html#installing-from-sources) to get `tf.reset_default_graph()`. — mrry, Jan 14 '16 at 16:14

TensorFlow on Jupyter: Can't restore variables

1 Answers1