Cloud ML Engine allows one to deploy nearly any TensorFlow model you are able to export. For any such model, you define your inputs and your outputs and that is what dictates the form of the requests and responses.
I think it might be useful to understand this by walking through an example. You could imagine exporting a model like so:
def my_model(image_in):
# Construct an inference graph for predicting captions
#
# image_in is a tensor/array with shape=(None,) and dtype=string
# Meaning, a batch of raw image bytes.
... Do your interesting stuff here ...
# caption_out is a tensor/matrix with shape=(None, MAX_WORDS) and
# dtype=tf.string), that is, you will be returning a batch
# of captions, one per input image, one word per column with
# padding when the number of words to output is < MAX_WORDS
return caption_out
image_in = tf.placeholder(shape=(None,), dtype=tf.string)
caption_out = my_model(image_in)
inputs = {"image_bytes": tf.saved_model.utils.build_tensor_info(image_in}
outputs = {"caption": tf.saved_model.utils.build_tensor_info(caption_out)}
signature = tf.saved_model.signature_def_utils.build_signature_def(
inputs=inputs,
outputs=outputs,
method_name='tensorflow/serving/predict'
)
After you export this model (cf this post), you would construct a JSON request like so:
{
"instances": [
{
"image_bytes": {
"b64": <base64_encoded_image1>
}
},
{
"image_bytes": {
"b64": <base64_encoded_image2>
}
}
]
}
Let us analyze the request. First, we will be sending the service a batch of images. All requests are a JSON object with an array-valued attribute called "instances"; each entry in the array is an instance to feed the graph for prediction. Note that this is why we are required to set the outermost dimension to None
on models that are exported -- they need to be able to handle variable-sized batches.
Each entry in the array is itself a JSON object, where the attributes are the keys of the input
dict we defined when exporting the model. In this case, we only defined image_bytes
. Because image_bytes
is a byte string, we need to base64 encode the data, which we do by passing JSON objects of the form {"b64": <data>}
. If we wanted to send more than one image to the service, we could add a similar entry for each image to the instance
array.
Now, a sample JSON response for this example might look like:
{
"predictions": [
{
"caption": [
"the",
"quick",
"brown",
"",
"",
""
]
},
{
"caption": [
"A",
"person",
"on",
"the",
"beach",
""
]
}
]
}
All responses are JSON objects with an array-valued attribute called "predictions". Each element of the array is the prediction associated with the corresponding input in the instances
array from the request.
Each entry in the array is a JSON object whose attributes are determined by the keys of the outputs
dict that we exported earlier. In this case, we have a single output per input called caption
. Note that the source tensor for caption caption_out
is actually a matrix, where the number of rows is equal to the number of instances sent to the service and the number of columns we defined to be some constant. However, instead of returning the matrix, the service independently returns each row of the matrix as entries in the prediction
array. The second dimension of the matrix was some constant, and presumably the model itself will pad extra words as the empty string (as depicted above).
One very important note: in the examples above, I showed the raw JSON request/response bodies. From your post, it is apparent you are using Google's generic client which is parsing the response and adding structure around it, specifically, the object you are printing out wraps predictions within nested fields [data]
and then [modelData:protected]
.
My personal recommendation is to not use that client for the service and to instead use a generic request/response library (along with Google's authentication library), but since you have things working you're welcome to use whatever works for you.