NEON emulation of VNNI instructions

Question

There is new AVX-512 VNNI instructions in Cascade Lake Intel CPU which can accelerate inference of neural networks on CPU. I integrated them into Simd Library to accelerate Synet (my small framework for inference of neural networks) and obtained significant performance boost.

In fact I used only one instruction _mm512_dpbusd_epi32 (vpdpbusd) which allows to perform multiplication of 8-bit signed and unsigned integers and then accumulates them into 32-bit integer accumulators.

It will be great to to perform analogue optimizations for NEON (ARM platform).

So there is a question:

Is exist any analogue of NEON instruction to emulate vpdpbusd? If there is no analogue what is the best way to emulate the instruction ?

There is a scalar implementation below (to best understand what the function must do):

inline void pdpbusd(int32x4_t& sum, uint8x16_t input, int8x16_t weight)
{
    for (size_t i = 0; i < 4; ++i)
        for (size_t j = 0; j < 4; ++j)
            sum[i] += int32_t(input[i * 4 + j]) * int32_t(weight[i * 4 + j]);
}

Ermig, please disclose your affiliation with the synet project. Also, please rephrase the projects description, "inference" is not a verb and I get the impression that you are using it as one. A loosely related recommended read: https://stackoverflow.com/help/promotion — Yunnosch, Mar 11 '20 at 10:11

mstorsjo · Answer 1 · 2020-03-11T08:15:09.103

The most straightforward implementation of that requires 3 instructions; vmovl.s8, vmovl.u8 to extend the signed and unsigned 8 bit values to 16 bit, followed by vmlal.s16, to do a signed lengthening 16 bit multiplication, accumulated into a 32 bit register. And as the vmlal.s16 only handles 4 elements, you'd need a second vmlal.s16 to multiply and accumulate the following 4 elements - so 4 instructions for 4 elements.

For aarch64 syntax, the corresponding instructions are sxtl, uxtl and smlal.

Edit: If the output elements should be aggregated horizontally, one can't use the fused multiply-accumulate instructions vmlal. Then the solution would be vmovl.s8 and vmovl.u8, followed by vmul.i16 (for 8 input elements), vpaddl.s16 (to aggregate two elements horizontally), followed by another vpadd.i32 to get the sum of 4 elements horizontally. So 5 instructions for 8 input elements, or 10 instructions for a full 128 bit vector, followed by one final vadd.s32 to accumulate the final result to the accumulator. On AArch64, the equivalent of vpadd.i32, addp, can handle 128 bit vectors, so it's one instruction less there.

If you're using instrinsics, the implementation could look something like this:

int32x4_t vpdpbusd(int32x4_t sum, uint8x16_t input, int8x16_t weight) {
    int16x8_t i1 = vreinterpretq_s16_u16(vmovl_u8(vget_low_u8(input)));
    int16x8_t i2 = vreinterpretq_s16_u16(vmovl_u8(vget_high_u8(input)));
    int16x8_t w1 = vmovl_s8(vget_low_s8(weight));
    int16x8_t w2 = vmovl_s8(vget_high_s8(weight));
    int16x8_t p1 = vmulq_s16(i1, w1);
    int16x8_t p2 = vmulq_s16(i2, w2);
    int32x4_t s1 = vpaddlq_s16(p1);
    int32x4_t s2 = vpaddlq_s16(p2);
#if defined(__aarch64__)
    int32x4_t s3 = vpaddq_s32(s1, s2);
#else
    int32x4_t s3 = vcombine_s32(
        vpadd_s32(vget_low_s32(s1), vget_high_s32(s1)),
        vpadd_s32(vget_low_s32(s2), vget_high_s32(s2))
    );  
#endif
    sum = vaddq_s32(sum, s3);
    return sum;
}

No, 4 instructions. For 8 input elements processed from each of the 8-bit signed and unsigned inputs, you need 1 `vmovl.s8`, 1 `vmovl.u8` and 2 `vmlal.s16` to process them. — mstorsjo, Mar 10 '20 at 13:26
Unfortunately your solution is not work properly (see scalar implementation above). — ErmIg, Mar 11 '20 at 07:24
Right, so if you want individual 4 adjacent elements aggregated at the same time, it requires a bit more code, as one can't use the fused multiplication-accumulate instructions then, I'll edit the answer accordingly. — mstorsjo, Mar 11 '20 at 08:06
Thanks. Last version gives correct result. But its performance is only two times higher than scalar. There is no performance boost in using of int8 compare with fp32 (if it is vectorized). — ErmIg, Mar 11 '20 at 13:01
That's probably understandable - for one single invocation of this, even though the input is int8, it ends up with multiplication in int16 form and accumulation in int32 form. If multiple iterations of the same operation would be merged closer together with each other, there might be some speedup from this form (where the int16 multiplications can give some increased throughput over something that operates on fp32). — mstorsjo, Mar 11 '20 at 13:05

NEON emulation of VNNI instructions

1 Answers1