training MNIST with TPU generates errors

Question

Following the Running MNIST on Cloud TPU tutorial:

I get the following error when I try to train:

python /usr/share/models/official/mnist/mnist_tpu.py \
  --tpu=$TPU_NAME \
  --DATA_DIR=${STORAGE_BUCKET}/data \
  --MODEL_DIR=${STORAGE_BUCKET}/output \
  --use_tpu=True \
  --iterations=500 \
  --train_steps=2000

=>

alexryan@alex-tpu:~/tpu$ ./train-mnist.sh 
W1025 20:21:39.351166 139745816463104 __init__.py:44] file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/googleapiclient/discovery_cache/__init__.py", line 41, in autodetect
    from . import file_cache
  File "/usr/local/lib/python2.7/dist-packages/googleapiclient/discovery_cache/file_cache.py", line 41, in <module>
    'file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth')
ImportError: file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth
Traceback (most recent call last):
  File "/usr/share/models/official/mnist/mnist_tpu.py", line 173, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "/usr/share/models/official/mnist/mnist_tpu.py", line 152, in main
    tpu_config=tf.contrib.tpu.TPUConfig(FLAGS.iterations, FLAGS.num_shards),
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_config.py", line 207, in __init__
    self._master = cluster.master()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/cluster_resolver/python/training/tpu_cluster_resolver.py", line 223, in master
    job_tasks = self.cluster_spec().job_tasks(self._job_name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/cluster_resolver/python/training/tpu_cluster_resolver.py", line 269, in cluster_spec
    (compat.as_text(self._tpu), response['health']))
RuntimeError: TPU "alex-tpu" is unhealthy: "TIMEOUT"
alexryan@alex-tpu:~/tpu$

The only places where I varied from the instructions were:

Instead of running ctpu in the cloud shell, I ran it on the mac.

>ctpu version
ctpu version: 1.7

The zone where the TPU resided was different than the default zone of my config, so I specified it as an option like so:

>cat ctpu-up.sh 
ctpu up --zone us-central1-b --preemptible

I was able to move the MNIST files to the gcs bucket from the vm no problem:

alexryan@alex-tpu:~$ gsutil cp -r ./data ${STORAGE_BUCKET}
Copying file://./data/validation.tfrecords [Content-Type=application/octet-stream]...
Copying file://./data/train-images-idx3-ubyte.gz [Content-Type=application/octet-stream]...

I tried the (Optional) Set up TensorBoard > Running cloud_tpu_profiler

Go to the Cloud Console > TPUs > and click on the TPU you created. Locate the service account name for the Cloud TPU and copy it, for example:

service-11111111118@cloud-tpu.iam.myserviceaccount.com

In the list of buckets, select the bucket you want to use, select Show Info Panel, and then select Edit bucket permissions. Paste your service account name into the add members field for that bucket and select the following permissions:

"Cloud Console > TPUs" does not exist as an option
so I used the service account associate with the VM
"Cloud Console > Compute Engine > alex-tpu"

since the last error message was "RuntimeError: TPU "alex-tpu" is unhealthy: "TIMEOUT", I used ctpu to delete the vm and re-create it and ran it again. This time I got more errors:

This one seems like it might be just a warning ...

ImportError: file_cache is unavailable when using oauth2client >=
4.0.0 or google-auth

Not sure about this one ...

ERROR:tensorflow:Operation of type Placeholder (reshape_input) is not supported on the TPU. Execution will fail if this op is used in the graph.

this one seemed to kill the training ...

INFO:tensorflow:Error recorded from training_loop: File system scheme '[local]' not implemented (file: '/tmp/tmpaiggRW/model.ckpt-0_temp_9216e11a1368405795d9b5282775f562')      [[{{node save/SaveV2}} = SaveV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT64],
_device="/job:worker/replica:0/task:0/device:CPU:0"](save/ShardedFilename, save/SaveV2/tensor_names, save/SaveV2/shape_and_slices, conv2d/bias/Read/ReadVariableOp, conv2d/kernel/Read/ReadVariableOp, conv2d_1/bias/Read/ReadVariableOp, conv2d_1/kernel/Read/ReadVariableOp, dense/bias/Read/ReadVariableOp, dense/kernel/Read/ReadVariableOp, dense_1/bias/Read/ReadVariableOp, dense_1/kernel/Read/ReadVariableOp, global_step/Read/ReadVariableOp)]]

Caused by op u'save/SaveV2', defined at:   File "/usr/share/models/official/mnist/mnist_tpu.py", line 173, in <module>
    tf.app.run()   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))   File "/usr/share/models/official/mnist/mnist_tpu.py", line 163, in main
    estimator.train(input_fn=train_input_fn, max_steps=FLAGS.train_steps)   File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2394, in train
    saving_listeners=saving_listeners   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 356, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1181, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1215, in _train_model_default
    saving_listeners)   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1406, in _train_with_estimator_spec
    log_step_count_steps=self._config.log_step_count_steps) as mon_sess:   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 504, in MonitoredTrainingSession
    stop_grace_period_secs=stop_grace_period_secs)   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 921, in __init__
    stop_grace_period_secs=stop_grace_period_secs)   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 643, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1107, in __init__
    _WrappedSession.__init__(self, self._create_session())   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1112, in _create_session
    return self._sess_creator.create_session()   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 800, in create_session
    self.tf_sess = self._session_creator.create_session()   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 557, in create_session
    self._scaffold.finalize()   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 215, in finalize
    self._saver.build()   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1106, in build
    self._build(self._filename, build_save=True, build_restore=True)   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1143, in _build
    build_save=build_save, build_restore=build_restore)   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 778, in _build_internal
    save_tensor = self._AddShardedSaveOps(filename_tensor, per_device)   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 369, in _AddShardedSaveOps
    return self._AddShardedSaveOpsForV2(filename_tensor, per_device)   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 343, in _AddShardedSaveOpsForV2
    sharded_saves.append(self._AddSaveOps(sharded_filename, saveables))   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 284, in _AddSaveOps
    save = self.save_op(filename_tensor, saveables)   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 202, in save_op
    tensors)   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1690, in save_v2
    shape_and_slices=shape_and_slices, tensors=tensors, name=name)   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
    return func(*args, **kwargs)   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3272, in create_op
    op_def=op_def)   File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1768, in __init__
    self._traceback = tf_stack.extract_stack()

UnimplementedError (see above for traceback): File system scheme '[local]' not implemented (file: '/tmp/tmpaiggRW/model.ckpt-0_temp_9216e11a1368405795d9b5282775f562')

UPDATE

I get this error ...

INFO:tensorflow:Error recorded from training_loop: File system scheme '[local]' not implemented

... even when --use_tpu=False

alexryan@alex-tpu:~/tpu$ cat train-mnist.sh 
python /usr/share/models/official/mnist/mnist_tpu.py \
  --tpu=$TPU_NAME \
  --DATA_DIR=${STORAGE_BUCKET}/data \
  --MODEL_DIR=${STORAGE_BUCKET}/output \
  --use_tpu=False \
  --iterations=500 \
  --train_steps=2000

This stack overflow answer suggests that the tpu is trying to write to a non-existent file system instead of the gcs bucket I specified. It is unclear to me why that would be happening.

score 1 · Answer 1 · answered Oct 25 '18 at 23:35

In the first scenario, it seems the TPU you created is not in healthy state. So, deleting and recreating the TPU or the entire VM is the right way to resolve this.

I think the error comes in second scenario (where you deleted the vm and re-created it again) is because your ${STORAGE_BUCKET} is either undefined or not a proper GCS bucket. It should be a GCS bucket. Local path won't work and gives the following error. More information on creating a GCS bucket is in the section "Create a Cloud Storage bucket" at https://cloud.google.com/tpu/docs/tutorials/mnist

Hope this answers your question.

My bad. I should have specified that I was able to "gsutil cp" the MNIST data from the the vm to the gcs bucket. However, I'm not certain that the vm does the writing. The instructions to setup profiling with tensorboard required giving TPU permissions to the bucket. — Alex Ryan, Oct 26 '18 at 06:10

score 0 · Answer 2 · answered Nov 25 '18 at 10:11

Ran into the same problem and found that there was a typo in the tutorial. If you check mnist_tpu.py you'll find that the params need to be lowercase.

If you change that, it works fine.

python /usr/share/models/official/mnist/mnist_tpu.py \
  --tpu=$TPU_NAME \
  --data_dir=${STORAGE_BUCKET}/data \
  --model_dir=${STORAGE_BUCKET}/output \
  --use_tpu=True \
  --iterations=500 \
  --train_steps=2000

training MNIST with TPU generates errors

2 Answers2