Can we predict image captions using google-ml-engine?

Question

The census and flowers samples show how to predict class labels using Google´s ml-engine.

Can we deploy our own model to generate image caption? If yes, how does the prediction work? What will be the format of the prediction response?

To be more specific, in the attachment shown below, probabilities sub array gives index and chance of each class. If we use an image caption model, how will the prediction response look like?

Attachment: http://boaloysius.me/sites/default/files/inline-images/predict1_0.png

score 0 · Answer 1 · answered Jul 10 '17 at 14:59

Cloud ML Engine allows one to deploy nearly any TensorFlow model you are able to export. For any such model, you define your inputs and your outputs and that is what dictates the form of the requests and responses.

I think it might be useful to understand this by walking through an example. You could imagine exporting a model like so:

def my_model(image_in):
  # Construct an inference graph for predicting captions
  #
  # image_in is a tensor/array with shape=(None,) and dtype=string
  # Meaning, a batch of raw image bytes.

  ... Do your interesting stuff here ...      

  # caption_out is a tensor/matrix with shape=(None, MAX_WORDS) and 
  # dtype=tf.string), that is, you will be returning a batch
  # of captions, one per input image, one word per column with
  # padding when the number of words to output is < MAX_WORDS
  return caption_out

image_in = tf.placeholder(shape=(None,), dtype=tf.string)
caption_out = my_model(image_in)

inputs = {"image_bytes": tf.saved_model.utils.build_tensor_info(image_in}
outputs = {"caption": tf.saved_model.utils.build_tensor_info(caption_out)}

signature = tf.saved_model.signature_def_utils.build_signature_def(
      inputs=inputs,
      outputs=outputs,
      method_name='tensorflow/serving/predict'
)

After you export this model (cf this post), you would construct a JSON request like so:

{
  "instances": [
    {
      "image_bytes": {
        "b64": <base64_encoded_image1>
      }
    },
    {
      "image_bytes": {
        "b64": <base64_encoded_image2>
      }
    }
  ]
}

Let us analyze the request. First, we will be sending the service a batch of images. All requests are a JSON object with an array-valued attribute called "instances"; each entry in the array is an instance to feed the graph for prediction. Note that this is why we are required to set the outermost dimension to None on models that are exported -- they need to be able to handle variable-sized batches.

Each entry in the array is itself a JSON object, where the attributes are the keys of the input dict we defined when exporting the model. In this case, we only defined image_bytes. Because image_bytes is a byte string, we need to base64 encode the data, which we do by passing JSON objects of the form {"b64": <data>}. If we wanted to send more than one image to the service, we could add a similar entry for each image to the instance array.

Now, a sample JSON response for this example might look like:

{
  "predictions": [
    {
      "caption": [
        "the",
        "quick",
        "brown",
        "",
        "",
        ""
      ]
    },
    {
      "caption": [
        "A",
        "person",
        "on",
        "the",
        "beach",
        ""
      ]
    }
  ]
}

All responses are JSON objects with an array-valued attribute called "predictions". Each element of the array is the prediction associated with the corresponding input in the instances array from the request.

Each entry in the array is a JSON object whose attributes are determined by the keys of the outputs dict that we exported earlier. In this case, we have a single output per input called caption. Note that the source tensor for caption caption_out is actually a matrix, where the number of rows is equal to the number of instances sent to the service and the number of columns we defined to be some constant. However, instead of returning the matrix, the service independently returns each row of the matrix as entries in the prediction array. The second dimension of the matrix was some constant, and presumably the model itself will pad extra words as the empty string (as depicted above).

One very important note: in the examples above, I showed the raw JSON request/response bodies. From your post, it is apparent you are using Google's generic client which is parsing the response and adding structure around it, specifically, the object you are printing out wraps predictions within nested fields [data] and then [modelData:protected].

My personal recommendation is to not use that client for the service and to instead use a generic request/response library (along with Google's authentication library), but since you have things working you're welcome to use whatever works for you.

Why is this model giving caption as an array of string? Can we define it to give just one string with multiple words? — boby aloysius johnson, Jul 13 '17 at 03:32
As long as you can define a TensorFlow model that produces such an output, you absolutely can. — rhaertel80, Jul 13 '17 at 06:38

Can we predict image captions using google-ml-engine?

1 Answers1