3

I'm using open source Tensorflow implementations of research papers, for example DCGAN-tensorflow. Most of the libraries I'm using are configured to train the model locally, but I want to use Google Cloud ML to train the model since I don't have a GPU on my laptop. I'm finding it difficult to change the code to support GCS buckets. At the moment, I'm saving my logs and models to /tmp and then running a 'gsutil' command to copy the directory to gs://my-bucket at the end of training (example here). If I try saving the model directly to gs://my-bucket it never shows up.

As for training data, one of the tensorflow samples copies data from GCS to /tmp for training (example here), but this only works when the dataset is small. I want to use celebA, and it is too large to copy to /tmp every run. Is there any documentation or guides for how to go about updating code that trains locally to use Google Cloud ML?

The implementations are running various versions of Tensorflow, mainly .11 and .12

psoulos
  • 800
  • 8
  • 16

1 Answers1

11

There is currently no definitive guide. The basic idea would be to replace all occurrences of native Python file operations with equivalents in the file_io module, most notably:

These functions will work locally and on GCS (as well as any registered file system). Note, however, that there are some slight differences in file_io and the standard file operations (e.g., a different set of 'modes' are supported).

Fortunately, checkpoint and summary writing do work out of the box, just be sure to pass a GCS path to tf.train.Saver.save and tf.summary.FileWriter.

In the sample you sent, that looks potentially painful. Consider monkey patching the Python functions to map to the TensorFlow equivalents when the program starts to only have to do it once (demonstrated here).

As a side note, all of the samples on this page show reading files from GCS.

Community
  • 1
  • 1
rhaertel80
  • 8,254
  • 1
  • 31
  • 47
  • Thanks for this info. I'm running into an issue when trying to read in an image file. Instead of using scipy.misc.imread("gs://BUCKET/PATH") which doesn't work with a GCS URI, I am first opening the file using file_io: scipy.misc.imread(file_io.FileIO(path, mode='r')). This seems to be returning a object instead of an array. I'm not sure how to fix this problem on Cloud ML since the discussions online seem to indicate that it is an issue with the PIL install. – psoulos Mar 15 '17 at 20:03
  • This is an ugly hack, but you can use subprocess in your setup.py to do things like `apt-get` and `pip install`. That said, let me investigate. It might make sense to post a separate question about this issue. – rhaertel80 Mar 16 '17 at 04:21
  • Noted this in my answer to your other question here (http://stackoverflow.com/questions/42821093/google-cloud-ml-scipy-misc-imread-returning-pil-jpegimageplugin-jpegimagefile) but the issue appears to be a bug in the version of file_io in TF 0.12.1, and is fixed in TF 1.0 – Chris Meyers Mar 16 '17 at 17:10
  • tf.summary.FileWriter seems to work when provided with a GCS path, but tf.train.Saver.save only works when it is given a GCS path where the parent directories already exist. I believe this is a bug since GCS is supposed to be flat storage where file hierarchy is provided as a convenience. – psoulos Mar 17 '17 at 00:52
  • As a follow-up, the crash occurs here: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/training/saver.py#L1380 – psoulos Mar 17 '17 at 00:59
  • That might be by design, but I'll double check. Especially if the behavior for summaries and checkpoints differ. – rhaertel80 Mar 17 '17 at 13:26
  • The gcs_file_system.cc in TensorFlow code implements a FileSystem interface, so that the higher-level code (including the Python code like saver.py) is expected to assume that GCS is a traditional file system and make no difference in interacting with GCS vs local file system. With this principle in mind, IsDirectory in GCS returns false if no objects (including zero-length directory markers) exist with the corresponding prefix. – Alexey Surkov Apr 20 '17 at 18:10
  • regarding "As a side note, all of the samples on this page show reading files from GCS.", is this true? It seems all these samples copy the files from gs: to a local file in the datalab vm and then read the file from local file system. I found this post because I am trying to read binary files from gs: to my GCP hosted datalab notebook. – netskink Jul 09 '18 at 15:12
  • The Keras sample is copying an .h5 file from local disk to GCS because .h5 is not supported on GCS. But I believe the other samples are reading directly from GCS. Can you point me to the problematic samples? Or can I help you read files from GCS? – rhaertel80 Jul 09 '18 at 21:16