Quantization aware training in tensorflow 2.2.0 producing higher inference time

Question

I'm working on quantization in transfer learning using MobilenetV2 for personal dataset. There are 2 approaches that I have tried:

i.) Only post training quantization: It is working fine and is producing 0.04s average time for inference of 60 images at 224,224 dimensions.

ii.) Quantization aware training + post training Quantization: It is producing greater accuracy than post training quantization only but is producing a higher inference time of 0.55s for the same 60 images.

1.) Only post training quantization model(.tflite) can be inferenced by:

        images_ = cv2.resize(cv2.cvtColor(cv2.imread(imagepath), cv2.COLOR_BGR2RGB), (224, 224))
        images = preprocess_input(images_)
        interpreter.set_tensor(
                    interpreter.get_input_details()[0]['index'], [x])
        interpreter.invoke()
        classes = interpreter.get_tensor(
            interpreter.get_output_details()[0]['index'])

2.) Quantization aware training + post training quantization can be inferenced by the below code. The difference is that here it asks for float32 input.

        images_ = cv2.resize(cv2.cvtColor(cv2.imread(imagepath), cv2.COLOR_BGR2RGB), (224, 224))
        images = preprocess_input(images_)
        x = np.expand_dims(images, axis=0).astype(np.float32)
        interpreter.set_tensor(
                    interpreter.get_input_details()[0]['index'], x)
        interpreter.invoke()
        classes = interpreter.get_tensor(
            interpreter.get_output_details()[0]['index'])

I have searched a lot but didn't got any response for this query. If possible please help with why I'm getting the inference time high in case of quantization aware training + post training quantization compared to only post training quantization?

score 0 · Answer 1 · answered Sep 10 '20 at 04:39

0

I don't think you should do quantization aware training + post training quantization to together.

According to https://www.tensorflow.org/model_optimization/guide/quantization/training_example, If you use quantization aware training, the conversion will give you a model with int8 weights. So, there is no point to do the post training quantization here.

answered Sep 10 '20 at 04:39

Thaink

96
3

Hi, in the article it is clearly mentioned that : "Note that the resulting model is quantization aware but not quantized (e.g. the weights are float32 instead of int8). The sections after show how to create a quantized model from the quantization aware one." And to make the weights int8 we need to do post training quantization. Correct me if I am wrong. – Aparajit Garg Sep 10 '20 at 05:42
Aparajit You are right, you have to do the post quantize. QAT is just train the model to operate in the quantization range. Post quantize is the way you apply to get the tflite model – dtlam26 Sep 10 '20 at 06:39
For detail, you can surely QAT in tf2.x or tf1.x to get your model well operate in that range of quantization and export it by a post quantize with no harm. This is the part tensorflow use as post quantize after QAT https://www.tensorflow.org/model_optimization/guide/quantization/training_comprehensive_guide?hl=en#create_and_deploy_quantized_model – dtlam26 Sep 10 '20 at 06:43
I think you misread the question. The problem I'm facing is that the inference time is higher in quantization aware training. If required please go through the question again and let me know if you have any doubts. – Aparajit Garg Sep 14 '20 at 06:55

score 0 · Answer 2 · answered Jan 12 '23 at 02:39

0

I think the part that converts from uint8 to float32 (.astype(np.float32)) is what makes it slower. Otherwise, they should be at the same speed.

answered Jan 12 '23 at 02:39

Louis Yang

3,511
1
25
24

Quantization aware training in tensorflow 2.2.0 producing higher inference time

2 Answers2