1

Hi I'm developing and image processing application on the Nxp imx7 and I want to compare performance of NEON instrutions vs pure C.

c: a,b,c are float32. Take 11ms to run

for(int pixIndex = 0; pixIndex<(640*480); pixIndex++)
{
    a[pixIndex] = (a[pixIndex] * b[pixIndex]) + c[pixIndex];
}

NEON: Take 10ms to run

for(int pixIndex = 0; pixIndex < (640*480)/2; pixIndex++)
{
    float32x2_t dVect1, dVect2,dVect3;

    dVect1 = vld1_f32(a);
    dVect2 = vld1_f32(b);
    dVect3 = vld1_f32(c);
    dVect1 = vmla_f32(dVect3, dVect1, dVect2);
    vst1_f32(a, dVect1);
    a += 2;
    b += 2;
    c += 2;
}

Why NEON is only 1ms faster than c ? Do I miss something here ?

Steve Friedl
  • 3,929
  • 1
  • 23
  • 30
  • 2
    Double check your assembly output to see if it is doing what you expect. Better yet, see if you can convince your compiler to do the work for you by giving hints to the vectorizer. Ah, they are floats. Ok. – Michael Dorgan Dec 11 '19 at 00:55
  • Any memory caching on the system? You are spending most of your time basically loading and saving memory. If you have no cache on this memory, that would explain the lack of perf. Ok, specs say L1/L2. – Michael Dorgan Dec 11 '19 at 00:58
  • 1
    Basically then, all the NEON can give you is 2 multiplies at a time since the loads and stores are going to be bus bound. I am guess this works out to the small gain you are seeing in perf. – Michael Dorgan Dec 11 '19 at 01:02
  • 1
    Most probably your C version gets auto-vectorized. – Jake 'Alquimista' LEE Dec 11 '19 at 08:12

0 Answers0