There is new AVX-512 VNNI instructions in Cascade Lake Intel CPU which can accelerate inference of neural networks on CPU. I integrated them into Simd Library to accelerate Synet (my small framework for inference of neural networks) and obtained significant performance boost.
In fact I used only one instruction _mm512_dpbusd_epi32
(vpdpbusd
) which allows to perform multiplication of 8-bit signed and unsigned integers and then accumulates them into 32-bit integer accumulators.
It will be great to to perform analogue optimizations for NEON (ARM platform).
So there is a question:
Is exist any analogue of NEON instruction to emulate vpdpbusd
? If there is no analogue what is the best way to emulate the instruction ?
There is a scalar implementation below (to best understand what the function must do):
inline void pdpbusd(int32x4_t& sum, uint8x16_t input, int8x16_t weight)
{
for (size_t i = 0; i < 4; ++i)
for (size_t j = 0; j < 4; ++j)
sum[i] += int32_t(input[i * 4 + j]) * int32_t(weight[i * 4 + j]);
}