Error when Submitting Job Training in Google Cloud ML

Question

I'm currently trying to submit a job training on Google Cloud ML with the Facenet (a Tensorflow library for face recognition). I'm currently trying this (link is here) part of the library where it does the training for the model.

Going to Google Cloud ML, I'm following this tutorial (link is here) where it teaches you how to submit a training.

I was able to successfully submit a job training to Google Cloud ML but there was an error. Here are some pictures of the errors:

And here's an error from the Google Cloud Jobs logs

Here are more detailed pictures on Google Cloud Job logs

Submitting a job request was a success and it was even waiting for Tensorflow to start but right after that there's that error.

The commands I used to run this is here:

gcloud ml-engine jobs submit training facetraining_test4 \
--package-path=/Users/myname/Documents/projects/tf-projects/facenet/src/ \
--module-name=/Users/myname/Documents/projects/tf-projects/facenet/src/facenet_train_classifier.py \
--staging-bucket=gs://facenet-training-test \
--region=asia-east1 \
--config=/Users/myname/Documents/projects/tf-projects/facenet/none_config.yml  \
-- \
--logs_base_dir=/Users/myname/Documents/projects/tf-projects/logs/facenet/ \
--models_base_dir=/Users/myname/Documents/projects/tf-projects/models/facenet/ \
--data_dir=/Users/myname/Documents/projects/tf-projects/facenet_datasets/employee_dataset/employee/employee_maxpy_mtcnnpy_182/ \
--image_size=160 \
--model_def=models.inception_resnet_v1 \
--lfw_dir=/Users/myname/Documents/projects/tf-projects/facenet_datasets/lfw/lfw_mtcnnpy_160/ \
--optimizer=RMSPROP \
--learning_rate -1  \
--max_nrof_epochs=80 \
--keep_probability=0.8 \
--learning_rate_schedule_file=/Users/myname/Documents/projects/tf-projects/facenet/data/learning_rate_schedule_classifier_casia.txt \
--weight_decay=5e-5  \
--center_loss_factor=1e-4  \

Any suggestions on how to fix this? Thanks!

If you go to https://console.cloud.google.com/mlengine/jobs you'll see a list of jobs and a link to the logs. Look for errors there and report back what you find. — rhaertel80, Mar 05 '17 at 01:55
Are you sure you are not returning a non-zero status code at the end of your training script? If not, I'd simply add some logging statements to your .py file and then check the logs in console.cloud.google.com/mlengine/jobs for your job to see where it is crashing. — Amir Hormati, Mar 05 '17 at 03:44
@rhaertel80 I've added more pictures for a more detailed look at the errors. — Mikebarson, Mar 05 '17 at 06:50
@AmirHormati I'm quite new to the Google Cloud ML. I'm having a hard time trying to understand the errors. I've added pictures for a more detailed look at the errors. If you could help me understand the errors that would be great! — Mikebarson, Mar 05 '17 at 06:51
@Mikebarson Do you have the refactored version that can run on Cloud ML? — wiput1999, May 30 '18 at 11:30

score 1 · Answer 1 · answered Mar 09 '17 at 15:31

1

When you are running on Cloud ML Engine you are running in a remote environment; so the file paths will not be the same as the local environment. If you need to import python modules you need to include them in the Python package you build and then import them using the package name.

For docs on how to build packages please refer to the SetupTools docs

Here's the 30 second version

Organize your code as follows


    my_package/__init__.py
    my_package/moduleA.py
    my_package/moduleB.py
    my_package/...
    setup.py

For your setup.py file start with this

    from setuptools import find_packages
    from setuptools import setup

    REQUIRED_PACKAGES = []

    setup(
        name='my_package',
        version='0.1.1',
        author='Author',
        author_email='author@gmail.com',
        install_requires=REQUIRED_PACKAGES,
        packages=find_packages(),
        description='Description',
        requires=[],)

Build a package as follows

python ./setup.py sdist

When the package is installed in Cloud ML Engine you will be able to import your code as

from my_package import moduleA

answered Mar 09 '17 at 15:31

Jeremy Lewi

6,386
6
22
37

I see! So this applies to all the modules that it tries to import under facenet_train_classifier.py. What about those files like the '--log_base_dirs=...' those are the [USER_ARGS] section. Will I need to package those too? – Mikebarson Mar 09 '17 at 15:36
@Mikebarson: yes, they need to be made accessible remotely. From the names of those particular args, they look like output directories, so should be paths on GCS. There is an under-documented gcloud command-line parameter, --job-dir. That directory will be passed to your programs as a command-line argument '--job-dir'. From there you can construct paths such as 'logs' and 'models'. – rhaertel80 Mar 11 '17 at 17:32
@rhaertel80 So you mean that I should put those '--log_base_dirs' and '--models_base_dirs' in a file. Then pass that file to '--job_dir'? How should I format those commands inside the file and what extension should I call the file? – Mikebarson Mar 13 '17 at 06:51
@Mikebarson. There are two issues. (1) the output directories need to be locations on GCS (2) independent of that is how to pass those into the CloudML Engine. You can continue to use your own command-line args ('--log_base_dirs', '--models_base_dirs'), just make sure they point to a valid GCS bucket, something like, "gs://my-ml-bucket/job_x/log". Consider using the same bucket, but different 'sub-directory', as your input data. – rhaertel80 Mar 13 '17 at 14:12
@rhaertel80 I see. I did that actually. But Google Cloud couldn't locate it. I'll try again. So what's the use of '--job-dir'? – Mikebarson Mar 14 '17 at 06:38
@Mikebarson. I need more info to be able to figure out why Google cloudn't locate it. For instance, you can't use os.path.exists on GCS paths (see http://stackoverflow.com/a/42799952/1399222 for a solution). Regarding `--job-dir`, if you choose to use it, the service will ensure you have write permissions before running your job and then pass it through to your binary as `--job-dir` (so be sure to parse it). Then you can use it to build sub-directories for logs and models, e.g., `model_dir = os.path.join(args.job_dir, 'model')` – rhaertel80 Mar 15 '17 at 02:20
@rhaertel80 Alright. I don't think I've used os.path.exists. It also left me wondering why it couldn't locate the directories when they're already in the bucket. About the `--job-dir` I'm still a bit confused. When should I use `--job-dir` and why? – Mikebarson Mar 15 '17 at 02:24
@Mikebarson. In general, you can't use any regular file operations from the standard library (glob, list exists, mkdir), so check that, too. What's the exact error? `--job-dir` is a convenience function for allowing the service to verify the permissions. It should also create subdirectories for you during hyperparameter tuning so that you don't have to manage that yourself and risk overwriting previous output. – rhaertel80 Mar 15 '17 at 02:28
@rhaertel80 So if my file contains any of those operations, it would likely result to an error? So `--job-dir` is used for verification of permissions and creation of sub-dirs during hyperparamater tuning. I'm using the hyperparameter tuning so that probably won't be something I'll need using for now. Can you perhaps check this [https://github.com/davidsandberg/facenet/blob/master/src/facenet_train_classifier.py]. That's what I'm using to run the job which is shown above. Taking note what Jermey Lewi said, I need to package the files that are being imported in this file for it to work. – Mikebarson Mar 15 '17 at 02:42
Your problem likely stems from the calls to `os.path.isdir()` and `os.path.makedirs()`. You should be able to safely replace those with the equivalent functionality in `file_io.py` (see SO link in previous comment for more info). Those changes will still work with local runs. – rhaertel80 Mar 15 '17 at 03:10

score 0 · Answer 2 · edited May 23 '17 at 11:46

0

Looking at the error message above, it seems that you issue is related to "ImportError: import by filename is not supported" error from Python. Without looking at your python source code, I can't tell you exactly how to fix it but the following link should solve your issue:

Python / ImportError: Import by filename is not supported

In general look for places that you are importing using file paths and make sure you are using the functions correctly.

edited May 23 '17 at 11:46

Community

1
1

answered Mar 05 '17 at 18:09

Amir Hormati

329
2
6

Could it be that Google can't find the files? This was working properly in my local terminal. – Mikebarson Mar 06 '17 at 02:02
When you are running on Cloud ML Engine you are running in a remote environment; so the file paths will not be the same as the local environment. If you need to import python modules you write you need to include them in the Python package you build and then import them using the package name. – Jeremy Lewi Mar 06 '17 at 14:14
@JeremyLewi can you show me an example? That would really help! Thanks! :D – Mikebarson Mar 08 '17 at 14:25
I added an answer with more details. – Jeremy Lewi Mar 09 '17 at 15:32
Thanks! @JeremyLewi I'll be testing your solutions soon. – Mikebarson Mar 13 '17 at 06:52

Error when Submitting Job Training in Google Cloud ML

2 Answers2