Following the Running MNIST on Cloud TPU tutorial:
I get the following error when I try to train:
python /usr/share/models/official/mnist/mnist_tpu.py \
--tpu=$TPU_NAME \
--DATA_DIR=${STORAGE_BUCKET}/data \
--MODEL_DIR=${STORAGE_BUCKET}/output \
--use_tpu=True \
--iterations=500 \
--train_steps=2000
=>
alexryan@alex-tpu:~/tpu$ ./train-mnist.sh
W1025 20:21:39.351166 139745816463104 __init__.py:44] file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/googleapiclient/discovery_cache/__init__.py", line 41, in autodetect
from . import file_cache
File "/usr/local/lib/python2.7/dist-packages/googleapiclient/discovery_cache/file_cache.py", line 41, in <module>
'file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth')
ImportError: file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth
Traceback (most recent call last):
File "/usr/share/models/official/mnist/mnist_tpu.py", line 173, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/usr/share/models/official/mnist/mnist_tpu.py", line 152, in main
tpu_config=tf.contrib.tpu.TPUConfig(FLAGS.iterations, FLAGS.num_shards),
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_config.py", line 207, in __init__
self._master = cluster.master()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/cluster_resolver/python/training/tpu_cluster_resolver.py", line 223, in master
job_tasks = self.cluster_spec().job_tasks(self._job_name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/cluster_resolver/python/training/tpu_cluster_resolver.py", line 269, in cluster_spec
(compat.as_text(self._tpu), response['health']))
RuntimeError: TPU "alex-tpu" is unhealthy: "TIMEOUT"
alexryan@alex-tpu:~/tpu$
The only places where I varied from the instructions were:
Instead of running ctpu in the cloud shell, I ran it on the mac.
>ctpu version
ctpu version: 1.7
The zone where the TPU resided was different than the default zone of my config, so I specified it as an option like so:
>cat ctpu-up.sh
ctpu up --zone us-central1-b --preemptible
I was able to move the MNIST files to the gcs bucket from the vm no problem:
alexryan@alex-tpu:~$ gsutil cp -r ./data ${STORAGE_BUCKET}
Copying file://./data/validation.tfrecords [Content-Type=application/octet-stream]...
Copying file://./data/train-images-idx3-ubyte.gz [Content-Type=application/octet-stream]...
I tried the (Optional) Set up TensorBoard > Running cloud_tpu_profiler
Go to the Cloud Console > TPUs > and click on the TPU you created. Locate the service account name for the Cloud TPU and copy it, for example:
service-11111111118@cloud-tpu.iam.myserviceaccount.com
In the list of buckets, select the bucket you want to use, select Show Info Panel, and then select Edit bucket permissions. Paste your service account name into the add members field for that bucket and select the following permissions:
"Cloud Console > TPUs" does not exist as an option
so I used the service account associate with the VM
"Cloud Console > Compute Engine > alex-tpu"
since the last error message was "RuntimeError: TPU "alex-tpu" is unhealthy: "TIMEOUT", I used ctpu to delete the vm and re-create it and ran it again. This time I got more errors:
This one seems like it might be just a warning ...
ImportError: file_cache is unavailable when using oauth2client >=
4.0.0 or google-auth
Not sure about this one ...
ERROR:tensorflow:Operation of type Placeholder (reshape_input) is not supported on the TPU. Execution will fail if this op is used in the graph.
this one seemed to kill the training ...
INFO:tensorflow:Error recorded from training_loop: File system scheme '[local]' not implemented (file: '/tmp/tmpaiggRW/model.ckpt-0_temp_9216e11a1368405795d9b5282775f562') [[{{node save/SaveV2}} = SaveV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT64],
_device="/job:worker/replica:0/task:0/device:CPU:0"](save/ShardedFilename, save/SaveV2/tensor_names, save/SaveV2/shape_and_slices, conv2d/bias/Read/ReadVariableOp, conv2d/kernel/Read/ReadVariableOp, conv2d_1/bias/Read/ReadVariableOp, conv2d_1/kernel/Read/ReadVariableOp, dense/bias/Read/ReadVariableOp, dense/kernel/Read/ReadVariableOp, dense_1/bias/Read/ReadVariableOp, dense_1/kernel/Read/ReadVariableOp, global_step/Read/ReadVariableOp)]]
Caused by op u'save/SaveV2', defined at: File "/usr/share/models/official/mnist/mnist_tpu.py", line 173, in <module>
tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv)) File "/usr/share/models/official/mnist/mnist_tpu.py", line 163, in main
estimator.train(input_fn=train_input_fn, max_steps=FLAGS.train_steps) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2394, in train
saving_listeners=saving_listeners File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 356, in train
loss = self._train_model(input_fn, hooks, saving_listeners) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1181, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1215, in _train_model_default
saving_listeners) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/estimator.py", line 1406, in _train_with_estimator_spec
log_step_count_steps=self._config.log_step_count_steps) as mon_sess: File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 504, in MonitoredTrainingSession
stop_grace_period_secs=stop_grace_period_secs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 921, in __init__
stop_grace_period_secs=stop_grace_period_secs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 643, in __init__
self._sess = _RecoverableSession(self._coordinated_creator) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1107, in __init__
_WrappedSession.__init__(self, self._create_session()) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1112, in _create_session
return self._sess_creator.create_session() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 800, in create_session
self.tf_sess = self._session_creator.create_session() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 557, in create_session
self._scaffold.finalize() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 215, in finalize
self._saver.build() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1106, in build
self._build(self._filename, build_save=True, build_restore=True) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1143, in _build
build_save=build_save, build_restore=build_restore) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 778, in _build_internal
save_tensor = self._AddShardedSaveOps(filename_tensor, per_device) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 369, in _AddShardedSaveOps
return self._AddShardedSaveOpsForV2(filename_tensor, per_device) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 343, in _AddShardedSaveOpsForV2
sharded_saves.append(self._AddSaveOps(sharded_filename, saveables)) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 284, in _AddSaveOps
save = self.save_op(filename_tensor, saveables) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 202, in save_op
tensors) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1690, in save_v2
shape_and_slices=shape_and_slices, tensors=tensors, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3272, in create_op
op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1768, in __init__
self._traceback = tf_stack.extract_stack()
UnimplementedError (see above for traceback): File system scheme '[local]' not implemented (file: '/tmp/tmpaiggRW/model.ckpt-0_temp_9216e11a1368405795d9b5282775f562')
UPDATE
I get this error ...
INFO:tensorflow:Error recorded from training_loop: File system scheme '[local]' not implemented
... even when --use_tpu=False
alexryan@alex-tpu:~/tpu$ cat train-mnist.sh
python /usr/share/models/official/mnist/mnist_tpu.py \
--tpu=$TPU_NAME \
--DATA_DIR=${STORAGE_BUCKET}/data \
--MODEL_DIR=${STORAGE_BUCKET}/output \
--use_tpu=False \
--iterations=500 \
--train_steps=2000
This stack overflow answer suggests that the tpu is trying to write to a non-existent file system instead of the gcs bucket I specified. It is unclear to me why that would be happening.