Why is inference using tf.keras 75x slower than using TFLite?

Question

I run a code making some predictions on audio data using a simple CNN.

When using tf.keras.Model.predict I get an an average execution time of 0.17s, and when I use TF.lite.Interpreter I get 0.002s, about 75x faster ! I tried on my Desktop (Ubuntu 18.04, TF 2.1) and a Rapsberry Pi 3B+ (Raspbian Buster, same code) and get about the same difference.

Why is the difference that big ?

UPDATE: I set batch_size=1 in tf.keras.Model.predict and it is now 65x slower than TFLite.

test_tflite.py

import os
import pathlib
import tensorflow as tf
from tensorflow.keras.models import model_from_json
import numpy as np
import time


# disable GPU
tf.config.set_visible_devices([], 'GPU')


parent = pathlib.Path(__file__).parent.absolute()

# path to Tensorflow model and weights
MODEL_PATH = os.path.join(parent, 'models/vd_model.json')
WEIGHTS_PATH = os.path.join(parent, 'models/model.30-0.97.h5')
INPUT_SHAPE = (1, 43, 40, 1)

NUM_RUN = 100


def predict_tflite(interpreter, input_details, output_details, data):
    interpreter.set_tensor(input_details[0]['index'], data)
    interpreter.invoke()
    output_data = interpreter.get_tensor(output_details[0]['index'])
    return output_data


def run():

    # Load Tensorflow model
    with open(MODEL_PATH, 'r') as f:
        model = model_from_json(f.read())
    model.load_weights(WEIGHTS_PATH)

    # Show model
    model.summary()

    # Convert to TFLite
    converter = tf.lite.TFLiteConverter.from_keras_model(model)
    tflite_model = converter.convert()
    interpreter = tf.lite.Interpreter(model_content=tflite_model)
    interpreter.allocate_tensors() 
    input_details = interpreter.get_input_details()
    output_details = interpreter.get_output_details()

    predictions = []
    for i in range(NUM_RUN):

        # fake input data
        data = np.random.rand(*INPUT_SHAPE).astype(np.float32)

        # Tensorflow
        start_time = time.time()
        prediction = model.predict(data, batch_size=1)
        elapsed = time.time() - start_time

        # Tensoflow Lite
        start_time = time.time()
        prediction_tflite = predict_tflite(interpreter, input_details, output_details, data)
        elapsed_tflite = time.time() - start_time

        predictions.append(((elapsed, prediction), (elapsed_tflite, prediction_tflite)))

    # Make sure predictions are close
    for pred_tf, pred_tflite in predictions:
        if not np.all(np.isclose(pred_tf[1], pred_tflite[1])):
              print('Predictions are not close')

    # Compute average execution times
    tf_avg = np.mean([p[0] for p, _ in predictions])
    tflite_avg = np.mean([p[0] for _, p in predictions])

    print(f'TF: {tf_avg:.6f}')
    print(f'TFLite: {tflite_avg:.6f}')


if __name__ == "__main__":
    run()

Execution (Raspberry Pi):

pi@raspberrypi:~/src/audio_monitoring/audio_monitoring/tests $ python3 test_tflite.py 
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d (Conv2D)              (None, 43, 40, 16)        160       
_________________________________________________________________
batch_normalization (BatchNo (None, 43, 40, 16)        64        
_________________________________________________________________
activation (Activation)      (None, 43, 40, 16)        0         
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 22, 20, 16)        0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 22, 20, 32)        4640      
_________________________________________________________________
batch_normalization_1 (Batch (None, 22, 20, 32)        128       
_________________________________________________________________
activation_1 (Activation)    (None, 22, 20, 32)        0         
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 1, 1, 32)          0         
_________________________________________________________________
dropout (Dropout)            (None, 1, 1, 32)          0         
_________________________________________________________________
flatten (Flatten)            (None, 32)                0         
_________________________________________________________________
dense (Dense)                (None, 4)                 132       
=================================================================
Total params: 5,124
Trainable params: 5,028
Non-trainable params: 96
_________________________________________________________________

TF average prediction time: 0.168310s
TFLite average prediction time: 0.002269s

`model(x)` might be faster and stabler; see [here](https://stackoverflow.com/questions/60267911/keras-inconsistent-prediction-time/60307121#60307121) — OverLordGoldDragon, May 25 '20 at 13:51

score 3 · Accepted Answer · answered May 25 '20 at 13:36

There can be many reasons behind this performance difference, but to summarize:

At TFLite model conversion time, some graph optimizations are applied (constant folding, op fusing, etc.)
At conversion time, the static execution plan is determined ahead of time.
Even for CPU, TFLite often provides optimized kernel implementation for a specific CPU architecture (e.g., NEON on ARM).

That being said, not all TensorFlow models can be converted to TFLite, since TFLite only supports a subset of TensorFlow supported ops.

I think you'd find this tech talk interesting. Please have a look.

https://www.youtube.com/watch?v=gHN0jDbJz8E

Why is inference using tf.keras 75x slower than using TFLite?

1 Answers1