0

I am using AI Platform to train a Tensorflow model using the Estimator API. However, when the model saves a checkpoint and attempts to restore the checkpoint, it throws the error tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for gs://path/keras/keras_model.ckpt

It appears that this is an issue with restoring the metadata graph in Tensorflow, which is code that goes in the session setup (TensorFlow, why there are 3 files after saving the model?). However, since AI Platform abstracts this away from my configuration, how can I fix my issue?

Austin Guo
  • 167
  • 1
  • 12

1 Answers1

0

Nevermind, it appears that at the start of each new job run, the job directory from the previous run is not deleted (on purpose, so I can have multiple workers train at the same time). If the previous run failed, some checkpoints aren't stored properly and this causes problems for AI Platform.

Austin Guo
  • 167
  • 1
  • 12