There are AVX-512 VNNI instructions starting since Cascade Lake Intel CPU which can accelerate inference of quantized neural networks on CPU.
In particular there is a instuction _mm512_dpbusd_epi32
(vpdpbusd
) which allows to perform multiplication of 8-bit signed and unsigned integers and accumulate them into 32-bit integer accumulators.
There is a pseudo code of this instruction below:
void _mm512_dpbusd_epi32(int32_t sum[16], uint8_t a[16][4], int8_t b[16][4])
{
for(int i = 0; i < 16; ++i)
sum[i] +=
(int)a[i][0]*b[i][0] + (int)a[i][1]*b[i][1] +
(int)a[i][2]*b[i][2] + (int)a[i][3]*b[i][3];
}
Unfortunately the intel CPUs until Cascade Lake don't have this instruction so there is a question to emulate this one with using of previous extension (for example AVX-512BW). So my question is: How is make this emulation maximal effective as possible?