Run a tensorflow code in distributed mode on google cloud ML

Question

Does anybody know what changes need to be made to trainer in order to run a job on distributed platform on google cloud ML ?

It will of great help if somebody can share few articles or docs about the same.

score 1 · Accepted Answer · edited May 23 '17 at 11:46

1

By and large, your distributed TensorFlow program will be exactly that -- distributed TensorFlow, with minimal -- or even no -- cloud-specific changes. The best resource for distributed TensorFlow is this tutorial on tensorflow.org. The tutorial walks you through the low-level way of doing things.

There is also a higher-level API, currently in contrib (so API may change and will move out of contrib in a future version), that simplifies the amount of boilerplate code you have to write for distributed training. The official tutorial is here.

Once you've understood the general TensorFlow bits (whether high-level or low-level APIs), there are some specific elements that must be present in your code to get it to run on CloudML Engine. In the case of the low-level TensorFlow APIs, you'll need to parse the TF_CONFIG environment variable to setup your ClusterSpec. This is exemplified in this example (see specifically this block of code).

One advantage of the higher-level APIs, is that all of that parsing is already taken care of for you. Your code should just generally work. See this example. The important piece is that you will need to use learn_runner.run() (see this line), which will work locally and in the cloud to train your model.

Of course, there are other frameworks as well, e.g., TensorFX.

After you've structured your code appropriately, then you simply select an appropriate scale tier that has multiple machines when launching your training job. (See Chuck Finley's answer for an example)

Hope it helps!

edited May 23 '17 at 11:46

Community

1
1

answered Apr 04 '17 at 14:39

rhaertel80

8,254
1
31
47

Thanks a lot. It gives me a starting point as i was confused about how to modify my code. – Appu Apr 04 '17 at 19:39
I have been able to run code in distributed environment but as of now every machine in the cloud is receiving the full data and generating output after averaging. I want to know how I can give distributed data to each machine in the cloud . – Appu Apr 10 '17 at 06:47
While submittng a job on gogle cloud ML, I am getting an error where main training python file i.e. task.py is not able to import a function from a python script in util folder. Generally, we write : from util.xyz import abc this is not getting called in the main task.py – Appu Apr 12 '17 at 09:39
I have been able to run this using gcloud local training but this problem occurs when i try to use gcloud job submit . – Appu Apr 13 '17 at 05:44
how to install a python package from github using setup.py ? I am trying to install kenlm module from github passing it in setup.py.... Please help ..need answer urgently... – Appu Apr 13 '17 at 10:27
Local training works because it uses your system's setup. On the cloud you'll need to force certain libraries to be installe. RE: github, see http://stackoverflow.com/a/3481388/1399222. dependency_links should do the trick. – rhaertel80 Apr 14 '17 at 04:56

dumkar · Answer 2 · 2017-06-16T15:11:18.873

1

If you have your model constructed with Tensorflow Estimators, the changes you need to do are very minimal. You can basically plug your code into e.g. this boilerplate code.

edited Jun 16 '17 at 15:11

answered Jun 16 '17 at 15:03

dumkar

735
1
5
15

score 0 · Answer 3 · answered Apr 04 '17 at 14:20

0

Is your question answered by the argument "scale-tier" in Run Distributed Training in the Cloud?

gcloud ml-engine jobs submit training $JOB_NAME \
   --job-dir $OUTPUT_PATH \
   --runtime-version 1.0 \
   --module-name trainer.task \
   --package-path trainer/ \
   --region $REGION \
   --scale-tier STANDARD_1 \
   -- \
   --train-files $TRAIN_DATA \
   --eval-files $EVAL_DATA \
   --train-steps 1000 \
   --verbose-logging true

answered Apr 04 '17 at 14:20

Chuck Finley

250
1
10

Here's a direct link to the info about distributed training: https://cloud.google.com/ml-engine/docs/how-tos/getting-started-training-prediction#cloud-train-dist – rhaertel80 Apr 04 '17 at 14:42
i think this job submission will come at a later stage once i have modified my code as per distributed tensorflow. – Appu Apr 04 '17 at 19:39
I have been able to run code in distributed environment but as of now every machine in the cloud is receiving the full data and generating output after averaging. I want to know how I can give distributed data to each machine in the cloud – Appu Apr 10 '17 at 06:47
While submittng a job on gogle cloud ML, I am getting an error where main training python file i.e. task.py is not able to import a function from a python script in util folder. Generally, we write : from util.xyz import abc this is not getting called in the main task.py – Appu Apr 12 '17 at 09:39
@Appu did you get it to work? sounds like worth it's own question, but it's going to boil down to your setup.py. See stackoverflow.com/a/40287409/1399222 – rhaertel80 Apr 14 '17 at 04:57

Run a tensorflow code in distributed mode on google cloud ML

3 Answers3