I'm trying to re-implement Apple's vDSP_zvma function using NEON intrinsics (I'm porting my DSP code to Android):
void vDSP_zvma(const DSPSplitComplex *__A, vDSP_Stride __IA, const DSPSplitComplex *__B,
vDSP_Stride __IB, const DSPSplitComplex *__C, vDSP_Stride __IC,
const DSPSplitComplex *__D, vDSP_Stride __ID, vDSP_Length __N) {
vDSP_Length n = 0;
#ifdef __ARM_NEON
vDSP_Length postamble_start = __N & ~3;
for (; n < postamble_start; n += 4) {
float32x4_t Ar = vld1q_f32(__A->realp + n);
float32x4_t Br = vld1q_f32(__B->realp + n);
float32x4_t Cr = vld1q_f32(__C->realp + n);
float32x4_t Ai = vld1q_f32(__A->imagp + n);
float32x4_t Bi = vld1q_f32(__B->imagp + n);
float32x4_t Ci = vld1q_f32(__C->imagp + n);
float32x4_t Dr = vmlaq_f32(Cr, Ar, Br);
Dr = vmlsq_f32(Dr, Ai, Bi);
vst1q_f32(__D->realp + n, Dr);
float32x4_t Di = vmlaq_f32(Ci, Ar, Bi);
Di = vmlaq_f32(Di, Ai, Br);
vst1q_f32(__D->imagp + n, Di);
}
#endif
for (; n < __N; n++) {
__D->realp[n] =
__C->realp[n] + __A->realp[n] * __B->realp[n] - __A->imagp[n] * __B->imagp[n];
__D->imagp[n] =
__C->imagp[n] + __A->realp[n] * __B->imagp[n] + __A->imagp[n] * __B->realp[n];
}
}
However in my tests, the performance is relatively poor (about x3 without/with NEON). What might be the reason and what can be done to fix this?
Update: just to clarify - this code runs much faster than the naive loop in C (x3), however in other functions that I ported the performance gain was closer to x4 (as expected).