tflite quantized inference very slow

Question

I am trying to convert a trained model from checkpoint file to tflite. I am using tf.lite.LiteConverter. The float conversion went fine with reasonable inference speed. But the inference speed of the INT8 conversion is very slow. I tried to debug by feeding in a very small network. I found that inference speed for INT8 model is generally slower than float model.

In the INT8 tflite file, I found some tensors called ReadVariableOp, which doesn't exist in TensorFlow's official mobilenet tflite model.

I wonder what causes the slowness of INT8 inference.

score 2 · Answer 1 · answered Nov 27 '20 at 06:36

2

You possibly used x86 cpu instead of one with arm instructions. You can refer it here https://github.com/tensorflow/tensorflow/issues/21698#issuecomment-414764709

answered Nov 27 '20 at 06:36

Charlie Qiu

31
5

Mike B · Answer 2 · 2023-03-30T07:02:03.480

There can be many reasons for this, some of the most common:

Lack of INT8 instruction set architecture (ISA)

You won't for example see INT8 model boosts compared to Float32 on Intel CPUs under 10th gen. This is because Intel CPUs < 10th gen don't have Intel DLBoost, a specific instruction set (ISA) architecture designed to improve performance of INT8 DL models. This ISA is present in Intel chips from 10th gen onwards. Most certainly, without a specific INT8 ISA the operations get upsampled to Float32.
Float32 operations transform into verbose INT8 operations

Some Float32 operations are not INT8 friendly, which leads to a very verbose INT8 counterpart. Potentially much slower that the original operation

tflite quantized inference very slow

2 Answers2

Linked