0

Taking the code in this repository (https://github.com/tensorflow/models/tree/master/official/resnet) as an example, I adapted it to run on another dataset with a large number of classes.

Everything seems to work fine, convergence is good, except that each time a checkpoint is restored, the loss (and the train accuracy) suddenly spikes. After some time, it retrieves its old minima and goes down.

Is something not well restored? I mean, is it related to the fact that the checkpoint file contains nothing about the optimizer (like the gradients from the previous step)?

  • The problem might be more subtle, a potential misuse of the function one_hot. Indeed, I have integer labels that are not sequential, e.g. labels = [0, 10, 20, 40, 50], then I was using it like: onehot_labels = tf.one_hot(labels, 5) I just realized that the label 10, 20, 40, 50 are all equal in one_hot. I was assuming that internally there was a dictionary doing it correctly, but apparently you need to have sequential labels. I just launched a new experiment by doing the dictionary myself. – Jerome Maye Oct 04 '17 at 14:28
  • Still not working, same behavior. It’s kind of annoying, I’ll have to dig into the source code of Tensorflow to look whether there is an issue. I suspect the batch normalization to be faulty. – Jerome Maye Oct 06 '17 at 08:07

1 Answers1

0

i solve my problem by replacing saver = tf.train.import_meta_graph(checkpoint + '.meta') with saver = tf.train.Saver() during restore process. See this post: https://stackoverflow.com/a/41287885/6922356