0

I am deploying a model to Google Cloud ML for the first time. I have trained and tested the model locally and it still needs work but it works ok.

I have uploaded it to Cloud ML and tested with the same example images I test locally that I know get detections. (using this tutorial)

When I do this, I get no detections. At first I thought I had uploaded the wrong checkpoint but I tested and the same checkpoint works with these images offline, I don't know how to debug further.

When I look at the results the file

prediction.results-00000-of-00001

is just empty

and the file

prediction.errors_stats-00000-of-00001

contains the following text: ('No JSON object could be decoded', 1)

Is this a sign the detection has run and detected nothing, or is there some problem while running?

Maybe the problem is I am preparing the images wrong for uploading?

The logs show no errors at all

Thank you

EDIT:

I was doing more tests and tried to run the model locally using the command "gcloud ml-engine local predict" instead of the usual local code. I get the same result as online, no answer at all, but also no error message

EDIT 2: I am using a TF_Record file, so I don't understand the JSON response. Here is a copy of my command:

gcloud ml-engine jobs submit prediction ${JOB_ID} --data- format=tf_record \ --input-paths=gs://MY_BUCKET/data_dir/inputs.tfr \ --output-path=gs://MY_BUCKET/data_dir/version4 \ --region us-central1 \ --model="gcp_detector" \ --version="Version4"

  • 1
    It is difficult to troubleshoot your issue without an [MCV sample](https://stackoverflow.com/help/mcve) or a more detailed description of what you are doing and how you are doing it. However, given the error message shown, I would say that the issue is indeed in the input data (which should be a JSON file if you are using local/online prediction, and a `TFRecord` if you are using batch prediction). The tutorial you shared provides a snippet that transforms JPG images into a valid JSON file. Are you using it? If not, how are you transforming your images to JSON for the input? – dsesto Apr 30 '18 at 14:17
  • @dsesto That makes sense, however, I am using a TF_Record file, so I don't understand the JSON response. Here is a copy of my command: gcloud ml-engine jobs submit prediction ${JOB_ID} --data- format=tf_record \ --input-paths=gs://my_inference/data_dir/inputs.tfr \ --output-path=gs://my_inference/data_dir/version4 \ --region us-central1 \ --model="gcp_detector" \ --version="Version4" – Jose Alberto Soler May 01 '18 at 20:52
  • If you need to add code snippets and/or command, please do it better by editing the question, as they are easier to understand with the correct formatting. Also, no need to share bucket/project names, so you can obfuscate that part with something like `MY_BUCKET`. – dsesto May 02 '18 at 07:39
  • Getting back to the topic, in your question you said that you tried running the model **local** and **online**, and as I explained in my previous comment, those prediction modes work with JSON input data, and not `TFRecord`. However, the command you shared in your comment corresponds to a batch prediction, therefore I think there is some confusion with this topic. I would recommend that you start by testing your model locally (with a few samples, 2-3 instances), **using JSON**. Once you know it works, you can run predictions using Cloud ML Engine in batch/online mode, depending on your needs. – dsesto May 02 '18 at 07:40
  • @dsesto I understand the difference between batch and single prediction. I use a json file when I do single offline prediction. I found this user is having the same problem as me with the "No JSON object could be decoded" message. https://stackoverflow.com/questions/45984227/gcloud-jobs-submit-prediction-cant-decode-json-with-data-format-tf-record/50170548#50170548 – Jose Alberto Soler May 04 '18 at 08:35
  • Hi @JoseAlbertoSoler, I saw that you created a Support case with the GCP team in order to report this behavior, and the Cloud ML Engine specialist team is looking into this issue. Could you please update this thread with an answer summarizing the issue / resolution once you have a solution for it? That way, other users facing a similar issue will be able to see the solution that worked for you. Thanks in advance! – dsesto May 09 '18 at 14:58
  • Hi @dsesto, yes, of course, I will update it here. – Jose Alberto Soler May 10 '18 at 15:38
  • @dsesto added the fix – Jose Alberto Soler Jul 14 '18 at 04:52
  • Thanks for the heads-up! Feel free to post it as an answer and mark it as accepted so that the community sees that this issue was solved. – dsesto Jul 16 '18 at 07:06

1 Answers1

1

Works with the following commands

Model export:

# From tensorflow/models
export PYTHONPATH=$PYTHONPATH:/home/[user]/repos/DeepLearning/tools/models/research:/home/[user]/repos/DeepLearning/tools/models/research/slim
cd /home/[user]/repos/DeepLearning/tools/models/research
python object_detection/export_inference_graph.py \
    --input_type encoded_image_string_tensor \
    --pipeline_config_path /home/[user]/[path]/ssd_mobilenet_v1_pets.config \
    --trained_checkpoint_prefix /[path_to_checkpoint]/model.ckpt-216593 \
    --output_directory /[output_path]/output_inference_graph.pb

Cloud execution

gcloud ml-engine jobs submit prediction ${JOB_ID} --data-format=TF_RECORD \
    --input-paths=gs://my_inference/data_dir/inputs/* \
    --output-path=${YOUR_OUTPUT_DIR}  \
    --region us-central1 \
    --model="model_name" \
    --version="version_name" 

I don't know what change exactly fixes the issue, but there are some small changes like tf_record now being TF_RECORD. Hope this helps someone else. Props to google support for their help (they suggested the changes)