3

Not sure if this is the right place to ask this kind of question, but I can’t really find an example of how int8 inference works at runtime. What I know is that, given that we are performing uniform symmetric quantisation, we calibrate the model, i.e. we find the best scale parameters for each weight tensor (channel-wise) and activations (that corresponds to the outputs of the activation functions, if I understood correctly). After the calibration process we can quantize the model by applying these scale parameters and clipping che values that end up outside the dynamic range of the given layer. So at this point we have a new Neural Net where all the weights are int8 in the range [-127,127] and some scale parameters for the activations. What I don’t understand is how we perform inference on this new neural network, do we feed the input as float32 or directly as int8? All the computations are always in int8 or sometimes we cast from int8 to float32 and viceversa? It would be nice to find a real example of e.g. a CONV2D+BIAS+ReLU layer. If you could point me to some useful resources that would be appreciated.

Thanks

  • I do know nothing about `int8 inference`, But [Google](https://www.google.com/search?q=int8+inference) was able to find the [documentation; Int8 Inference](https://oneapi-src.github.io/oneDNN/dev_guide_inference_int8.html), and a nice doc which seems to be using it: [Achieving FP32 Accuracy for INT8 Inference Using Quantization Aware Training with NVIDIA TensorRT](https://developer.nvidia.com/blog/achieving-fp32-accuracy-for-int8-inference-using-quantization-aware-training-with-tensorrt/) – Luuk Dec 04 '22 at 10:54
  • I do not think Stackoverflow is meant for sharing resources this way... – Luuk Dec 04 '22 at 10:55
  • Related - https://stackoverflow.com/a/76192467/6561141 More on the quantization formula – mrtpk May 07 '23 at 07:42

0 Answers0