Why ARM NEON Intrinsics slower than C++ on simple vector multiplication task?

Question

I try to multiply data in two float pointers and store the result into the third pointer, here is the C++ code:

void cpp_version (float *a, float *b, float *c, int counter, int dim) {
    for (int i=0; i<counter; ++i) {
        for (int j=0; j<dim; ++j) {
            c[j] = a[j] * b[j];
        }
    }
}

Optimize it by NEON Intrinsics:

void neon_version (float *a, float *b, float *c, int counter, int dim) {
    for (int i=0; i<counter; ++i) {
        for (int j=0; j<dim; j+=4) {
            float32x4_t _a = vld1q_f32(a+j), _b = vld1q_f32(b+j);
            vst1q_f32(c+j, vmulq_f32(_a, _b));
        }
    }
}

Cross compile for Android deployment (Armv8-a) with NDK-Cmake:

cmake -DCMAKE_TOOLCHAIN_FILE=$NDK/build/cmake/android.toolchain.cmake \
-DANDROID_ABI="arm64-v8a" \
-DANDROID_NDK=$NDK \
-DANDROID_PLATFORM=android-22 \
..

make

The result is:

average time of neon: 0.0098 ms
average time of c++: 0.0067 ms

Why is NEON much slower than plain C++?

That is a very short test, did you run it multiple times and take averages? But mostly i suspect youre running both tests sequentially, and using the same data which means the plain C++ has the advantage of cache. — Borgleader, Jul 05 '22 at 03:47
@Borgleader I run both tests for counter=20 times, sequentially, using the same data. — Jersey, Jul 05 '22 at 03:50
Right, so if you invert the order of the tests, do the timings stay the same? — Borgleader, Jul 05 '22 at 03:52
In a lot of cases the compiler will just beat you at optimizing for specific hardware. It will take into account things like pipelining, branch prediction etc. to keep the CPU as busy as it can be. This might be interesting to watch : [what has my compiler done for me lately](https://www.youtube.com/watch?v=bSkpMdDe4g4). You can also have a look at compiler explorer and see what your original code compiled too (you might be surprised to see vector instructions there too). Example (for x86) here : https://godbolt.org/z/fWfx8Mz4x — Pepijn Kramer, Jul 05 '22 at 04:21
Is your data correctly aligned? If not then your load and stores will be inefficient — Alan Birtles, Jul 05 '22 at 06:02
@AlanBirtles What do you mean correctly aligned? The data are aligned by {float *a= (float*)malloc(size * sizeof(float));} so they are continuously stored. — Jersey, Jul 05 '22 at 06:09
`malloc` doesn't offer any alignment guarantees, you should use an aligned allocator. See https://stackoverflow.com/questions/45714535/performance-of-unaligned-simd-load-store-on-aarch64 for what alignment means — Alan Birtles, Jul 05 '22 at 06:42
@AlanBirtles `malloc` is guaranteed to return memory aligned for any primitive type. It's often more aligned. Should be 8 or 16 byte aligned on ARMv8. — Goswin von Brederlow, Jul 05 '22 at 08:57
@GoswinvonBrederlow simd operations are often more performant with more alignment than that, e.g. the answer i linked to suggests 64-bytes might improve performance — Alan Birtles, Jul 05 '22 at 09:21
@AlanBirtles That might be true. I was just debunking your claim that `malloc` has no alignment. — Goswin von Brederlow, Jul 05 '22 at 09:27
I'd have counter more like 2000 times than 20. Also, compiler will likely auto-vectorize into Neon, and may work out some unrolling, so that it does a better job of optimizing than your initial implementation. Even more so on a recent clang, but some Neon optimizations have been in there a while. You can turn off auto-vectorize with a flag if you wish to compare. — BenClark, Jul 08 '22 at 10:58
That's a pretty bad benchmark: the execution time is dominated by memory load/store where neon isn't any faster. And neon instructions come with longer latencies. You should unroll the inner loop to at least 16 floats per iteration for a more meaningful comparison. — Jake 'Alquimista' LEE, Jul 09 '22 at 19:20

score 2 · Answer 1 · answered Jul 05 '22 at 09:07

Looking at the gcc and clang output: https://godbolt.org/z/5csTEjf5o

Gcc seems to do a simple fmul loop, failing to vectorize this even with __restrict__ added.

Clang unrolls the loop and vectorizes it into blocks of 8 floats:

    fmul    v0.4s, v0.4s, v2.4s
    fmul    v1.4s, v1.4s, v3.4s

Lacking the right typedefs for your neon code I can't see what that turns into. But you only do 4 floats at a time. Doing 8 at a time might be faster.

You should really define a properly aligned structure for your vector of floats. Ideally with a compile time size. The compiler can optimize this a lot better if it knows it's e.g. 16 floats aligned to the SIMD registers.

Why ARM NEON Intrinsics slower than C++ on simple vector multiplication task?

1 Answers1