Tensorflow Lite inference - how do I scale down the convolution layer outputs?

Question

I built a simple CNN model with one convolutional layer and converted it with Tensorflow Lite. (for MNIST!!) So now my model gets 8-bit integer inputs and weights are 8-bit integers too.

I wanted to test the parameters I got from TFLite, so I wrote C code for the inference step.

Input image pixels were given the 8-bit integers between 0 and 255 and weights were between -128~127. (Biases were 32-bit integers.) The convolution results, of course, consisted of numbers bigger than 255.

I checked this paper(https://arxiv.org/pdf/1712.05877.pdf, "Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference") and it had some tips for what to do to this convolution result. It said I had to (1) scale down, (2) cast down (to uint8), and (3) apply the activation function to generate 8-bit output.

To my understanding, I needed to multiply 2^(-n) to the convolution results. So I divided the convolution outputs in 256 and limited the max number to 255, and further calculated them with fully connected layer's weights.

It showed a good result(accuracy 0.96+), but it was not as high as TFLite evaluation said. (accuracy 0.98+)

I don't think I did it in the right way because "256"(that I divided the convolution outputs into) was a random number. And actually when I changed it to 340, it showed the best result, but still far less than TFLite evaluation with TFLite Interpreter.

What is the correct and sophisticated way to implement the inference step? How do I scale down?

Related - https://stackoverflow.com/a/76192467/6561141 More on the quantization formula — mrtpk, May 07 '23 at 07:42

score 4 · Answer 1 · answered Jun 23 '20 at 18:59

This is a great question about the fundamentals of quantization in TF Lite. The paper you mention is a great reference and a guide for understanding the underlying math. TF Lite now uses a slightly different quantization scheme from the paper above, but still supports that scheme in full for models that were converted prior to the implementation of the current scheme. For your reference, you can see details of the new quantization scheme here:

https://www.tensorflow.org/lite/performance/quantization_spec

The answer to your question applies equally well to all quantization schemes in TF Lite. To the particulars of your question, you wish to understand how one can go from the 32-bit accumulator (the result of adding up all of the products of activation * filter) down to the quantized value (either uint8 or int8). From the paper, you can see that the matrix multiplication (it's a similar arithmetic for the convolution case you are interested in) is done with all integer operations, except for the real-valued multiplier M defined in Equation 5 in Section 2.2. The goal of the quantization scheme is to perform all of the math operations in integer-only arithmetic, so the challenge is then 'how do I multiply by real-valued M with only integer operations?'.

The "trick" is to represent M as is done in equation 6, as the product of 2 raised to some negative exponent times M_0, which is a real number of at least 0.5 and bound from above by 1. At first glance, that does not appear to make our problem any easier. However, first consider the 2^(-n) part. This can be represented on a computer as a just a bit shift (I'll talk about rounding in a second). Assuming any rounding issues are handled, that part is easy to do with only integer arithmetic. Now for the M_0 part. By construction, we have bound M_0 to a range where we can used a fixed point representation with an integer type (e.g. int32) and use all the bits as fractional bits (if you are not familiar with fixed point representation you might need to refer to outside sources of information).

We call the 32-bit fixed point representation of M_0 the "quantized multiplier". You can see the particulars of the operation in the link below, but, essentially, multiplying the accumulator by the quantized multiplier involves a standard integer multiplication resulting in a 64-bit number, and then taking the high 32-bits of that result.

The actual code is a bit difficult to go through because there are various issues to handle with proper rounding (as discussed in the paper), overflows, value saturation, clamping, etc. You can get started on understanding it though by looking at the reference implementation here:

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/kernels/internal/common.h#L153-L162

where SaturatingRoundingDoublingHighMul implements the fixed point multiplication by the quantized multiplier and RoundingDivideByPOT implements the multiplication by 2^(-n).

In actual code run on devices, TF Lite uses various kinds of optimized instructions to implement this arithmetic, but the reference code gets the same answer and is easier to inspect and understand. Hope that helps!

Thank you so much for your detailed explanation. I really appreciate it. I pondered over your answer for several hours, but I didn't fully understand. Using all the bits of M_0 as fractional bits means you're multiplying 2^32 (in case of int32), doesn't it? Say results of "activation * filter" are F (32bit), the downscaled value should be (F * (M_0<<32)) >> 32. But that's still 32 bits. Can you please explain a little more on how I can get 8-bit integers from this 32-bit results? — C. Kim, Jun 25 '20 at 11:37
I don't know what you mean by "Using all the bits of M_0 as fractional bits means you're multiplying 2^32 [..] doesn't it?" I would spend a bit of time working through exercises on multiplying two fixed point numbers. For example, a quick search shows this as a nice exmaple: https://www.allaboutcircuits.com/technical-articles/multiplication-examples-using-the-fixed-point-representation/ — T.J. Alumbaugh, Jun 25 '20 at 18:39
As you work through some basic example, you can then ask the question, "what if one operand is all integer bits, no fractional bits, and the other was all fractional bits and no integer bits?" that is essentially what we have in this case. Once you understand that part, then the RoundingDivideByPOT is pretty simple, since it just handles rounding, and then clamping to within the allowed 8bit range. — T.J. Alumbaugh, Jun 25 '20 at 18:39
@T.J.Alumbaugh I am working on this same thing, implementing in FPGA, and I'm uncertain of how to handle per-axis quantization. My example is a convnet that performs a quantized convolution, and the output has different quantization for each of its 12 channels. Then the result is flattened and fed to fully connected. Does the fully connected take into account the fact that the input has different quantization per channel? — A Tyshka, Mar 18 '21 at 18:31

Tensorflow Lite inference - how do I scale down the convolution layer outputs?

1 Answers1