NEON ASM code running much slower than C code?

Question

I'm trying to implement Gauss-Newton optimization for a specific problem on iPhone ARM using NEON. The first function below is my original C function. The second is the NEON asm code I wrote. I ran each one 100,000 times and the NEON version takes 7-8 times longer than C version. I think the loading (vld1.32) is what takes most of the time. I experimented by taking removing some instructions.

Does anyone have any insight into this problem? Thanks!

template<class T>
inline void GaussNewtonOperationJtr8x8(T Jtr[8], const T J[8], T residual)
{
    Jtr[0] -= J[0]*residual;
    Jtr[1] -= J[1]*residual;
    Jtr[2] -= J[2]*residual;
    Jtr[3] -= J[3]*residual;
    Jtr[4] -= J[4]*residual;
    Jtr[5] -= J[5]*residual;
    Jtr[6] -= J[6]*residual;
    Jtr[7] -= J[7]*residual;    
}

inline void GaussNewtonOperationJtr8x8_NEON(NFloat Jtr[8], const NFloat J[8], NFloat residual)
{
    __asm__ volatile (
                      // load Jtr into registers
                      "vld1.32   {d0-d3}, [%0]\n\t"
                      // load J into registers
                      "vld1.32   {d4-d7}, [%1]\n\t"
                      // load residual in register
                      "vmov.f32  s16, %2\n\t"
                      // Jtr -= J*residual
                      "vmls.f32  q0, q2, d8[0]\n\t"
                      "vmls.f32  q1, q3, d8[0]\n\t"
                      // store result
                      "vst1.32   {d0-d3}, [%0]\n\t"
                      // output
                      :
                      // input
                      : "r"(Jtr), "r"(J), "r"(residual)
                      // registers
                      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10", "d11", "d12", "d13", "d14"
                      );
}

Jake 'Alquimista' LEE · Accepted Answer · 2011-11-05T07:20:22.437

Don't use d8-d15. They have to be conserved onto stack prior to use. And restored after. The compiler will put instructions doing this, wasting valuable cycles.
Load J prior to Jtr. Jtr is expected at a later pipeline stage than J.
Use VLDMIA/VSTMIA instead of VLD/VST. VLDMIA/VSTMIA is faster and has advantage pipeline-wise.
Use vector-vector multiplication instead of vector-scalar multiplication.
If you create a looped version, put pld at the beginning and unroll the loop so that 64bytes are read from each pointer per iteration.

Beside those faults I mentioned above - which is typical for people new to NEON - Your approach is very nice. You found the most appropriate instruction in vmls.

Well done.

{

__asm__ volatile (
    // load residual in register
    "vdup.32  q12, %2\n\t"
    // load J into registers
    "vldmia   %1, {q10-q11}\n\t"
     // load Jtr into registers
    "vldmia   %0, {q8-q9}\n\t"
    // Jtr -= J*residual
    "vmls.f32  q8, q10, q12\n\t"
    "vmls.f32  q9, q11, q12\n\t"
    // store result
    "vstmia   %0, {q8-q9}\n\t"
    // output
    :
    // input
    : "r"(Jtr), "r"(J), "r"(residual)
    // registers
    : "q8", "q9", "q10", "q11", "q12"
);

score 3 · Answer 2 · edited May 23 '17 at 12:22

3

The compiler itself optimizes the assembly generated by the C code. It just doesn't translate one code to another.

What you are trying to do is make a better optimization then the compiler (oh ow). Do you know at least what's the assembly code the compiler is generating for the C code above? Well, you should if you want your assembly code to be better.

EDIT:

This thread has a great discussion about this sort of stuff: Why ARM NEON not faster than plain C++?

edited May 23 '17 at 12:22

Community

1
1

answered May 17 '11 at 17:57

karlphillip

92,053
36
243
426

I have not seen that the GCC compiler generates NEON code. So I'm experimenting by generating the ASM NEON code myself and comparing to C code. – paul May 17 '11 at 17:59
I read through this link more carefully. So I guess my example would not perform well using NEON? I moved the instructions around to remove dependency, but I didn't have any improvement. – paul May 17 '11 at 18:49
What's the time difference (in milliseconds) between one single execution (not 100.000) between your C code and the assembly you came up with? – karlphillip May 17 '11 at 18:52
I haven't found a high enough resolution timer that works on the iPhone to test just one iteration yet. – paul May 17 '11 at 18:54
I did: http://stackoverflow.com/q/3540234/176769 and there are 2 approaches for doing it. – karlphillip May 17 '11 at 19:04

score 3 · Answer 3 · answered May 18 '11 at 10:06

3

You're switching between NEON and VFP instructions. There's a penalty for doing so on both the Cortex-A8 and A9. Get rid of that VFP vmov.f32 instruction and also make sure that this code isn't inlined into places that use VFP code unless there's a long run of such code to justify the pipeline context switch.

answered May 18 '11 at 10:06

ohmantics

1,799
14
16

Thanks. Is there another way to get a single precision number into a NEON register? I'm need to get the "residual" parameter into a register. – paul May 18 '11 at 19:27
Make it be the first of an array of two floats and load it into a D register instead. Generally speaking, double and quad float operations are NEON, single float operations are VFP. – ohmantics May 20 '11 at 03:03

score 1 · Answer 4 · answered May 30 '11 at 17:57

Is your C++ version actually using floats? I can't tell because you only gave the template and didn't show which instantiation you used. It's very strange that NEON would be drastically slower than VFP on Cortex-A8 for this code, but for u32s I could see it possibly working out that way.

I don't know what the ABI is, but there could be some overhead for how the residual is passed (that is, what the compiler is doing to get it into that %2 register). Try using a pointer instead and use vld1 on single-element - you can load just one float in NEON this way.

You'll get better performance out of the arrays if you use 16-byte aligned loads and stores, but you may have to play some games to get the inputs to work this way. Unfortuantely, you'll never get really great performance out of this because you're not avoiding most of the latency of the vmls instruction which is lengthy (due to chaining the NEON multiply and add pipelines end to end). It's worse due to the dependent instruction being a store, which needs its input early in the NEON pipeline. Ideally you'll be able to do several of these operations at a time, and can interleave multiple instances together - as many as you can fit into registers.

Yes, the C++ version is using floats. – paul Jun 04 '11 at 14:22 — paul, Jun 04 '11 at 14:22

NEON ASM code running much slower than C code?

4 Answers4