9

I've been training a TensorFlow model for about a week with occasional sessions of fine-tuning.

Today when I tried to finetune the model I got the error:

tensorflow.python.framework.errors_impl.NotFoundError: Key conv_classifier/loss/total_loss/avg not found in checkpoint
 [[Node: save/RestoreV2_37 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_arg_save/Const_0_0, save/RestoreV2_37/tensor_names, save/RestoreV2_37/shape_and_slices)]]

Using inspect_checkpoint.py I see that the checkpoint file now has two empty layers in it:

...
conv_decode4/ort_weights/Momentum (DT_FLOAT) [7,7,64,64]
loss/cross_entropy/avg (DT_FLOAT) []
loss/total_loss/avg (DT_FLOAT) []
up1/up_filter (DT_FLOAT) [2,2,64,64]
...

How do I fix this problem?

SOLUTION:

Following suggestion by mrry below edited for clarity:

code_to_checkpoint_variable_map = {var.op.name: var for var in tf.global_variables()}
for code_variable_name, checkpoint_variable_name in {
     "inference/conv_classifier/weight_loss/avg" : "loss/weight_loss/avg",
     "inference/conv_classifier/loss/total_loss/avg" : "loss/total_loss/avg",
     "inference/conv_classifier/loss/cross_entropy/avg": "loss/cross_entropy/avg",
}.items():
    code_to_checkpoint_variable_map[checkpoint_variable_name] = code_to_checkpoint_variable_map[code_variable_name]
    del code_to_checkpoint_variable_map[code_variable_name]

saver = tf.train.Saver(code_to_checkpoint_variable_map)
saver.restore(sess, tf.train.latest_checkpoint('./logs'))
empty
  • 5,194
  • 3
  • 32
  • 58
  • Is the key `"loss/weight_loss/avg"` in the checkpoint file? The first exception message suggests that it is not. (I don't understand the other changes you've made, but the full code block looks sensible.) – mrry Oct 13 '17 at 20:28

1 Answers1

7

Fortunately, it doesn't look like your checkpoint is corrupt, but rather some of the variables in your program have been renamed. I'm assuming that the checkpoint value named "loss/total_loss/avg" should be restored to a variable named "conv_classifier/loss/total_loss/avg". You can solve this by passing a custom var_list when you create your tf.train.Saver.

name_to_var_map = {var.op.name: var for var in tf.global_variables()}

name_to_var_map["loss/total_loss/avg"] = name_to_var_map[
    "conv_classifier/loss/total_loss/avg"]
del name_to_var_map["conv_classifier/loss/total_loss/avg"]

# Depending on how the names have changed, you may also need to do:
# name_to_var_map["loss/cross_entropy/avg"] = name_to_var_map[
#     "conv_classifier/loss/cross_entropy/avg"]
# del name_to_var_map["conv_classifier/loss/cross_entropy/avg"]

saver = tf.train.Saver(name_to_var_map)

You can then use saver.restore() to restore your model. Alternatively, you can use this approach to restore the model and a default-constructed tf.train.Saver to save it in the canonical format.

mrry
  • 125,488
  • 26
  • 399
  • 400
  • I'm a little confused: `name_to_var_map["loss/total_loss/avg"]` doesn't need to be (and in fact shouldn't be) present because we're assigning to it. Just a sanity check: is `name_to_var_map["conv_classifier/loss/total_loss/avg"]` present? – mrry Oct 13 '17 at 00:53
  • That's what I was expecting, yes! I tried mutating a dictionary comprehension with Python 2.7 and 3.4 and it seemed to work. Perhaps adding `name_to_var_map = dict(name_to_var_map)` will make it mutable? – mrry Oct 13 '17 at 01:10
  • Ahh, sorry about that: the original dictionary comprehension should have used `var.op.name` as the key, and not `var.name`. (The `":0"` in the `Variable.name` property is a [historical artifact](https://stackoverflow.com/a/36156697/3574081).) Does that fix things? – mrry Oct 13 '17 at 15:07
  • Sad to say stuff is still broken. I've moved additional points to the OP below "EDIT:" because SO is giving me warnings about the comment chain. I've deleted previous comments because of this. – empty Oct 13 '17 at 18:50
  • Got it. Full answer in OP. Thanks for your help. – empty Oct 13 '17 at 21:38