Slow performance of code using NEON intrinsics

Question

I'm trying to re-implement Apple's vDSP_zvma function using NEON intrinsics (I'm porting my DSP code to Android):

void vDSP_zvma(const DSPSplitComplex *__A, vDSP_Stride __IA, const DSPSplitComplex *__B,
               vDSP_Stride __IB, const DSPSplitComplex *__C, vDSP_Stride __IC,
               const DSPSplitComplex *__D, vDSP_Stride __ID, vDSP_Length __N) {
    vDSP_Length n = 0;
#ifdef __ARM_NEON
    vDSP_Length postamble_start = __N & ~3;
    for (; n < postamble_start; n += 4) {
        float32x4_t Ar = vld1q_f32(__A->realp + n);
        float32x4_t Br = vld1q_f32(__B->realp + n);
        float32x4_t Cr = vld1q_f32(__C->realp + n);
        float32x4_t Ai = vld1q_f32(__A->imagp + n);
        float32x4_t Bi = vld1q_f32(__B->imagp + n);
        float32x4_t Ci = vld1q_f32(__C->imagp + n);

        float32x4_t Dr = vmlaq_f32(Cr, Ar, Br);
        Dr = vmlsq_f32(Dr, Ai, Bi);
        vst1q_f32(__D->realp + n, Dr);

        float32x4_t Di = vmlaq_f32(Ci, Ar, Bi);
        Di = vmlaq_f32(Di, Ai, Br);
        vst1q_f32(__D->imagp + n, Di);
    }
#endif
    for (; n < __N; n++) {
        __D->realp[n] =
                __C->realp[n] + __A->realp[n] * __B->realp[n] - __A->imagp[n] * __B->imagp[n];
        __D->imagp[n] =
                __C->imagp[n] + __A->realp[n] * __B->imagp[n] + __A->imagp[n] * __B->realp[n];
    }
}

However in my tests, the performance is relatively poor (about x3 without/with NEON). What might be the reason and what can be done to fix this?

Update: just to clarify - this code runs much faster than the naive loop in C (x3), however in other functions that I ported the performance gain was closer to x4 (as expected).

Benchmarking with `-O0` is totally useless, especially with intrinsics. Compile with `-O3` if you want the compiler to make good machine code from your intrinsics. — Peter Cordes, Oct 30 '19 at 14:58
Does this answer your question? [Why does this simple C++ SIMD benchmark run slower when SIMD instructions are used?](https://stackoverflow.com/questions/58365789/why-does-this-simple-c-simd-benchmark-run-slower-when-simd-instructions-are-us) — Peter Cordes, Oct 30 '19 at 15:02
@PeterCordes with -O3 the compiler optimises the original code as well. The doesn't change much, surprisingly. My compiler flags are usually -O3 -ffast-math. I'll edit the question. — Roman, Oct 30 '19 at 15:29
"with -O3 the compiler optimises the original code as well." Is that a problem? Do you mean "with -O3, my hand-tuned version is the same speed as the optimizer's version?" (If so, then wouldn't you just let the compiler do the work?) That said, I think you're missing an `#else` here. Do you really mean to run both `for` loops? — Rob Napier, Oct 30 '19 at 15:47
The second loop is for leftovers or when NEON is not used. It's going all over elements in that case. The reason for my question is that other functions that I ported are usually about x4 faster. x3 implies that it's still better than the compiler, but stalls somewhere. I'm not familiar with ARM architecture, thus I'm unable to tell what might be the culprit. Someone suggested to interleave loads and computation, but it didn't change the execution time as far as I can tell. — Roman, Oct 30 '19 at 15:52
"about x3 without/with NEON" - what does that mean? That it's 3x as fast compared to the code without SIMD? That's not bad — harold, Oct 30 '19 at 17:14
@harold perhaps this is the best that can be achieved without writing assembly. It's just that other functions I ported had better results... — Roman, Oct 30 '19 at 19:52
Can you use multiple vectors to software-pipeline and hide load and FP latency on an in-order pipeline? Or is the compiler doing that for you? It's not a reduction so multiple accumulators to hide FP latency wouldn't be needed with OoO exec. This probably has somewhat low computational intensity and may bottleneck on memory bandwidth, depending on what CPU you test on. — Peter Cordes, Oct 31 '19 at 06:34
@PeterCordes I'll have to look at the generated assembly. In any case, I was wondering whether I'm doing something inefficient here. But, perhaps, this is what can be done without writing assembly. — Roman, Oct 31 '19 at 08:38
What CPU are you benchmarking on? What CPUs do you care about performance on? Does that include some in-order CPUs like Cortex-A53 (widespread in budget phones)? If so, make sure you test on them. IDK how much they benefit from software pipelining, or what gcc or clang do by default there. — Peter Cordes, Oct 31 '19 at 08:46
Cortex-A53 is one of the "LITTLE" Arm cores - it has NEON, but it can do less per clock than the NEON you'd get on a "big" core such as Cortex-A72, etc. — solidpixel, Nov 30 '19 at 20:12

Slow performance of code using NEON intrinsics

0 Answers0