4

I'm trying to implement Gauss-Newton optimization for a specific problem on iPhone ARM using NEON. The first function below is my original C function. The second is the NEON asm code I wrote. I ran each one 100,000 times and the NEON version takes 7-8 times longer than C version. I think the loading (vld1.32) is what takes most of the time. I experimented by taking removing some instructions.

Does anyone have any insight into this problem? Thanks!

template<class T>
inline void GaussNewtonOperationJtr8x8(T Jtr[8], const T J[8], T residual)
{
    Jtr[0] -= J[0]*residual;
    Jtr[1] -= J[1]*residual;
    Jtr[2] -= J[2]*residual;
    Jtr[3] -= J[3]*residual;
    Jtr[4] -= J[4]*residual;
    Jtr[5] -= J[5]*residual;
    Jtr[6] -= J[6]*residual;
    Jtr[7] -= J[7]*residual;    
}

inline void GaussNewtonOperationJtr8x8_NEON(NFloat Jtr[8], const NFloat J[8], NFloat residual)
{
    __asm__ volatile (
                      // load Jtr into registers
                      "vld1.32   {d0-d3}, [%0]\n\t"
                      // load J into registers
                      "vld1.32   {d4-d7}, [%1]\n\t"
                      // load residual in register
                      "vmov.f32  s16, %2\n\t"
                      // Jtr -= J*residual
                      "vmls.f32  q0, q2, d8[0]\n\t"
                      "vmls.f32  q1, q3, d8[0]\n\t"
                      // store result
                      "vst1.32   {d0-d3}, [%0]\n\t"
                      // output
                      :
                      // input
                      : "r"(Jtr), "r"(J), "r"(residual)
                      // registers
                      : "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9", "d10", "d11", "d12", "d13", "d14"
                      );
}
paul
  • 257
  • 4
  • 13

4 Answers4

6
  1. Don't use d8-d15. They have to be conserved onto stack prior to use. And restored after. The compiler will put instructions doing this, wasting valuable cycles.
  2. Load J prior to Jtr. Jtr is expected at a later pipeline stage than J.
  3. Use VLDMIA/VSTMIA instead of VLD/VST. VLDMIA/VSTMIA is faster and has advantage pipeline-wise.
  4. Use vector-vector multiplication instead of vector-scalar multiplication.
  5. If you create a looped version, put pld at the beginning and unroll the loop so that 64bytes are read from each pointer per iteration.

Beside those faults I mentioned above - which is typical for people new to NEON - Your approach is very nice. You found the most appropriate instruction in vmls.

Well done.

{

__asm__ volatile (
    // load residual in register
    "vdup.32  q12, %2\n\t"
    // load J into registers
    "vldmia   %1, {q10-q11}\n\t"
     // load Jtr into registers
    "vldmia   %0, {q8-q9}\n\t"
    // Jtr -= J*residual
    "vmls.f32  q8, q10, q12\n\t"
    "vmls.f32  q9, q11, q12\n\t"
    // store result
    "vstmia   %0, {q8-q9}\n\t"
    // output
    :
    // input
    : "r"(Jtr), "r"(J), "r"(residual)
    // registers
    : "q8", "q9", "q10", "q11", "q12"
);
Jake 'Alquimista' LEE
  • 6,197
  • 2
  • 17
  • 25
3

The compiler itself optimizes the assembly generated by the C code. It just doesn't translate one code to another.

What you are trying to do is make a better optimization then the compiler (oh ow). Do you know at least what's the assembly code the compiler is generating for the C code above? Well, you should if you want your assembly code to be better.

EDIT:

This thread has a great discussion about this sort of stuff: Why ARM NEON not faster than plain C++?

Community
  • 1
  • 1
karlphillip
  • 92,053
  • 36
  • 243
  • 426
  • I have not seen that the GCC compiler generates NEON code. So I'm experimenting by generating the ASM NEON code myself and comparing to C code. – paul May 17 '11 at 17:59
  • I read through this link more carefully. So I guess my example would not perform well using NEON? I moved the instructions around to remove dependency, but I didn't have any improvement. – paul May 17 '11 at 18:49
  • What's the time difference (in milliseconds) between one single execution (not 100.000) between your C code and the assembly you came up with? – karlphillip May 17 '11 at 18:52
  • I haven't found a high enough resolution timer that works on the iPhone to test just one iteration yet. – paul May 17 '11 at 18:54
  • I did: http://stackoverflow.com/q/3540234/176769 and there are 2 approaches for doing it. – karlphillip May 17 '11 at 19:04
3

You're switching between NEON and VFP instructions. There's a penalty for doing so on both the Cortex-A8 and A9. Get rid of that VFP vmov.f32 instruction and also make sure that this code isn't inlined into places that use VFP code unless there's a long run of such code to justify the pipeline context switch.

ohmantics
  • 1,799
  • 14
  • 16
  • Thanks. Is there another way to get a single precision number into a NEON register? I'm need to get the "residual" parameter into a register. – paul May 18 '11 at 19:27
  • Make it be the first of an array of two floats and load it into a D register instead. Generally speaking, double and quad float operations are NEON, single float operations are VFP. – ohmantics May 20 '11 at 03:03
1

Is your C++ version actually using floats? I can't tell because you only gave the template and didn't show which instantiation you used. It's very strange that NEON would be drastically slower than VFP on Cortex-A8 for this code, but for u32s I could see it possibly working out that way.

I don't know what the ABI is, but there could be some overhead for how the residual is passed (that is, what the compiler is doing to get it into that %2 register). Try using a pointer instead and use vld1 on single-element - you can load just one float in NEON this way.

You'll get better performance out of the arrays if you use 16-byte aligned loads and stores, but you may have to play some games to get the inputs to work this way. Unfortuantely, you'll never get really great performance out of this because you're not avoiding most of the latency of the vmls instruction which is lengthy (due to chaining the NEON multiply and add pipelines end to end). It's worse due to the dependent instruction being a store, which needs its input early in the NEON pipeline. Ideally you'll be able to do several of these operations at a time, and can interleave multiple instances together - as many as you can fit into registers.

Exophase
  • 201
  • 1
  • 3