How to speed up Tensorflow 2 keras model for inference?

Question

So there's a big update nowadays, moving from TensorFlow 1.X to 2.X.

In TF 1.X I got use to a pipeline which helped me to push my keras model to production. The pipeline: keras (h5) model --> freeze & convert to pb --> optimize pb This workflow helped me to speed up the inference and my final model could be stored a single (pb) file, not a folder (see SavedModel format).

How can I optimize my model for inference in TensorFlow 2.0.0?

My first impression was that I need to convert my tf.keras model to tflite, but since my GPU uses float32 operations, this conversion would make my life even harder.

Thanks.

score 1 · Answer 1 · answered Jan 11 '21 at 14:04

One way to go about it is to optimize your model using Tensorflow with TensorRT (TF-TRT) (https://github.com/tensorflow/tensorrt). However, in Tensorflow 2, models are saved in a folder instead of a single .pb file. This is also the case for TF-TRT optimized models, they are stored in a folder. You can convert your model to TF-TRT as:

from tensorflow.python.compiler.tensorrt import trt_convert as trt
converter = tf.experimental.tensorrt.Converter(input_saved_model_dir=saved_model_dir)
converter.convert() 
converter.save("trt_optimized_model") # save it to a dir

If you have a requirement that the model needs to be contained in a single file (and do not care about the optimization offered by TF-TRT) you can convert the SavedModel to ONNX. And use ONNX runtime for inference. You can even go one step further here and convert the ONNX file into TensorRT (https://developer.nvidia.com/Tensorrt). This will give you a single optimized file that you can run using TensorRT (note that you cannot run the resulting file with Tensorflow anymore).

dragon7 · Answer 2 · 2022-07-28T12:47:44.870

The other way to go about it is to use a different toolkit for the inference e.g. OpenVINO. OpenVINO is optimized specifically for Intel hardware but it should work with any CPU. It optimizes your model by converting to Intermediate Representation (IR), performing graph pruning and fusing some operations into others while preserving accuracy. Then it uses vectorization in runtime.

It's rather straightforward to convert the Tensorflow model to OpenVINO unless you have fancy custom layers. The full tutorial on how to do it can be found here. Some snippets below.

Install OpenVINO

The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.

pip install openvino-dev[tensorflow2]

Save your model as SavedModel

OpenVINO is not able to convert the HDF5 model, so you have to save it as SavedModel first.

import tensorflow as tf
from custom_layer import CustomLayer
model = tf.keras.models.load_model('model.h5', custom_objects={'CustomLayer': CustomLayer})
tf.saved_model.save(model, 'model')

Use Model Optimizer to convert SavedModel model

The Model Optimizer is a command-line tool that comes from OpenVINO Development Package. It converts the Tensorflow model to IR, which is a default format for OpenVINO. You can also try the precision of FP16, which should give you better performance without a significant accuracy drop (just change data_type). Run in the command line:

mo --saved_model_dir "model" --input_shape "[1, 3, 224, 224]" --data_type FP32 --output_dir "model_ir"

Run the inference

The converted model can be loaded by the runtime and compiled for a specific device e.g. CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what is the best choice for you, just use AUTO.

# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="CPU")

# Get output layer
output_layer_ir = compiled_model_ir.output(0)

# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]

Disclaimer: I work on OpenVINO.

How to speed up Tensorflow 2 keras model for inference?

2 Answers2