Tensorflow object detection train.py fails when running on cloud machine learning engine

Question

I have a small working example of the tensorflow object detection api working locally. Everything looks great. My goal is to use their scripts to run in Google Machine Learning Engine, which i've used extensively in the past. I am following these docs.

Declare some relevant variables

declare PROJECT=$(gcloud config list project --format "value(core.project)")
declare BUCKET="gs://${PROJECT}-ml"
declare MODEL_NAME="DeepMeerkatDetection"
declare FOLDER="${BUCKET}/${MODEL_NAME}"
declare JOB_ID="${MODEL_NAME}_$(date +%Y%m%d_%H%M%S)"
declare TRAIN_DIR="${FOLDER}/${JOB_ID}"
declare EVAL_DIR="${BUCKET}/${MODEL_NAME}/${JOB_ID}_eval"
declare  PIPELINE_CONFIG_PATH="${FOLDER}/faster_rcnn_inception_resnet_v2_atrous_coco_cloud.config"
declare  PIPELINE_YAML="/Users/Ben/Documents/DeepMeerkat/training/Detection/cloud.yml"

My yaml looks like

trainingInput:
  runtimeVersion: "1.0"
  scaleTier: CUSTOM
  masterType: standard_gpu
  workerCount: 5
  workerType: standard_gpu
  parameterServerCount: 3
  parameterServerType: standard

The relevant paths are set in the config, e.g

  fine_tune_checkpoint: "gs://api-project-773889352370-ml/DeepMeerkatDetection/checkpoint/faster_rcnn_inception_resnet_v2_atrous_coco_11_06_2017/model.ckpt"

I've packaged object detection and slim using setup.py

Running

gcloud ml-engine jobs submit training "${JOB_ID}_train" \
    --job-dir=${TRAIN_DIR} \
    --packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz \
    --module-name object_detection.train \
    --region us-central1 \
    --config ${PIPELINE_YAML} \
    -- \
    --train_dir=${TRAIN_DIR} \
    --pipeline_config_path= ${PIPELINE_CONFIG_PATH}

yields a tensorflow (import?) error. Its a bit cryptic

insertId:  "1inuq6gg27fxnkc"  
 logName:  "projects/api-project-773889352370/logs/ml.googleapis.com%2FDeepMeerkatDetection_20171017_141321_train"  
 receiveTimestamp:  "2017-10-17T21:38:34.435293164Z"  
 resource: {…}  
 severity:  "ERROR"  
 textPayload:  "The replica ps 0 exited with a non-zero status of 1. Termination reason: Error. 
Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 198, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 145, in main
    model_config, train_config, input_config = get_configs_from_multiple_files()
  File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 127, in get_configs_from_multiple_files
    text_format.Merge(f.read(), train_config)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 112, in read
    return pywrap_tensorflow.ReadFromStream(self._read_buf, length, status)
  File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
FailedPreconditionError: .

I've seen this error in other questions related to prediction on Machine Learning Engine, suggesting this error probably(?) is not directly related to the object detection code, but it feels like its not being packaged correctly, missing dependencies? I've updated my gcloud to the latest version.

Bens-MacBook-Pro:research ben$ gcloud --version
Google Cloud SDK 175.0.0
bq 2.0.27
core 2017.10.09
gcloud 
gsutil 4.27

Hard to see how its related to this problem here

FailedPreconditionError when running TF Object Detection API with own model

why would code need to initialized differently in the cloud?

Update #1.

The curious thing is that the eval.py works fine, so it can't be a path to the config file, or anything that train.py and eval.py share. Eval.py patiently sits and waits for model checkpoints to be created.

Another idea might be that the checkpoint is somehow been corrupted during upload. We can test this bypassing and training from scratch.

In .config

  from_detection_checkpoint: false

that yields the the same precondition error, so it can't be the model.

It appears to be failing when trying to open the train_config file. It's hard to decode, but the error message has a "." which makes me think it's trying to read the local directory as the config file. How does your code set that filename? — rhaertel80, Oct 20 '17 at 06:40
The overall config file is set from command line --pipeline_config_path=${PIPELINE_CONFIG_PATH}, which is gs://api-project-773889352370-ml/DeepMeerkatDetection/faster_rcnn_inception_resnet_v2_atrous_coco_cloud.config, I also thought it would be a path error, but the eval.py script also takes in this argument and has no problem running. Okay, but the point is, you do not see this as a cloudml error, but something internally to debug. — bw4sz, Oct 20 '17 at 17:46
I looked at the logic in the code: if FLAGS.pipeline_config_path: model_config, train_config, input_config = get_configs_from_pipeline_file() else: model_config, train_config, input_config = get_configs_from_multiple_files() And the stack trace you send contains get_configs_from_multiple_files. But based on the information in your comment, you are trying to set `--pipeline_config_path`, so I presume you *expect* `get_configs_from_pipeline_file()` to be run instead. Clearly there is an issue with the Flags. Answered below. — rhaertel80, Oct 20 '17 at 19:04

score 0 · Accepted Answer · answered Oct 20 '17 at 19:08

0

The root cause is a slight typo:

--pipeline_config_path= ${PIPELINE_CONFIG_PATH}

has an extra space. Try this:

gcloud ml-engine jobs submit training "${JOB_ID}_train" \
    --job-dir=${TRAIN_DIR} \
    --packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz \
    --module-name object_detection.train \
    --region us-central1 \
    --config ${PIPELINE_YAML} \
    -- \
    --train_dir=${TRAIN_DIR} \
    --pipeline_config_path=${PIPELINE_CONFIG_PATH}

answered Oct 20 '17 at 19:08

rhaertel80

8,254
1
31
47

sigh. For those wondering the "precondition" took the space as current working directory, so only " " was parsed, therefore train.py was looking for a config named " ". Properly sad. – bw4sz Oct 20 '17 at 21:04

Tensorflow object detection train.py fails when running on cloud machine learning engine

1 Answers1

Linked