I try to multiply data in two float pointers and store the result into the third pointer, here is the C++ code:
void cpp_version (float *a, float *b, float *c, int counter, int dim) {
for (int i=0; i<counter; ++i) {
for (int j=0; j<dim; ++j) {
c[j] = a[j] * b[j];
}
}
}
Optimize it by NEON Intrinsics:
void neon_version (float *a, float *b, float *c, int counter, int dim) {
for (int i=0; i<counter; ++i) {
for (int j=0; j<dim; j+=4) {
float32x4_t _a = vld1q_f32(a+j), _b = vld1q_f32(b+j);
vst1q_f32(c+j, vmulq_f32(_a, _b));
}
}
}
Cross compile for Android deployment (Armv8-a) with NDK-Cmake:
cmake -DCMAKE_TOOLCHAIN_FILE=$NDK/build/cmake/android.toolchain.cmake \
-DANDROID_ABI="arm64-v8a" \
-DANDROID_NDK=$NDK \
-DANDROID_PLATFORM=android-22 \
..
make
The result is:
average time of neon: 0.0098 ms
average time of c++: 0.0067 ms
Why is NEON much slower than plain C++?