ARM neon optimization - getting rid of superfluous loads

Question

I'm trying to build an optimized right-hand matrix multiplication using arm neon. This

void transform ( glm::mat4 const & matrix, glm::vec4 const & input, glm::vec4 & output )
{
   float32x4_t &       result_local = reinterpret_cast < float32x4_t & > (*(&output[0]));
   float32x4_t const & input_local  = reinterpret_cast < float32x4_t const & > (*(&input[0] ));

   result_local = vmulq_f32 (               reinterpret_cast < float32x4_t const & > ( matrix[ 0 ] ), input_local );
   result_local = vmlaq_f32 ( result_local, reinterpret_cast < float32x4_t const & > ( matrix[ 1 ] ), input_local );
   result_local = vmlaq_f32 ( result_local, reinterpret_cast < float32x4_t const & > ( matrix[ 2 ] ), input_local );
   result_local = vmlaq_f32 ( result_local, reinterpret_cast < float32x4_t const & > ( matrix[ 3 ] ), input_local );
}

The compiler (gcc) does produce neon instructions, however, it seems that the input parameter (which is supposedly in x1) is reloaded to q1 after every fmla call:

0x0000000000400a78 <+0>:    ldr q1, [x1]
0x0000000000400a7c <+4>:    ldr q0, [x0]
0x0000000000400a80 <+8>:    fmul    v0.4s, v0.4s, v1.4s
0x0000000000400a84 <+12>:   str q0, [x2]
0x0000000000400a88 <+16>:   ldr q2, [x0,#16]
0x0000000000400a8c <+20>:   ldr q1, [x1]
0x0000000000400a90 <+24>:   fmla    v0.4s, v2.4s, v1.4s
0x0000000000400a94 <+28>:   str q0, [x2]
0x0000000000400a98 <+32>:   ldr q2, [x0,#32]
0x0000000000400a9c <+36>:   ldr q1, [x1]
0x0000000000400aa0 <+40>:   fmla    v0.4s, v2.4s, v1.4s
0x0000000000400aa4 <+44>:   str q0, [x2]
0x0000000000400aa8 <+48>:   ldr q2, [x0,#48]
0x0000000000400aac <+52>:   ldr q1, [x1]
0x0000000000400ab0 <+56>:   fmla    v0.4s, v2.4s, v1.4s
0x0000000000400ab4 <+60>:   str q0, [x2]
0x0000000000400ab8 <+64>:   ret

Is it possible to evade this too?

Compiler is gcc-linaro-6.3.1-2017.05-x86_64_aarch64-linux-gnu with O2 option.

Regards

Edit: Removing the reference on input_local did the trick:

0x0000000000400af0 <+0>:    ldr q1, [x1]
0x0000000000400af4 <+4>:    ldr q0, [x0]
0x0000000000400af8 <+8>:    fmul    v0.4s, v1.4s, v0.4s
0x0000000000400afc <+12>:   str q0, [x2]
0x0000000000400b00 <+16>:   ldr q2, [x0,#16]
0x0000000000400b04 <+20>:   fmla    v0.4s, v1.4s, v2.4s
0x0000000000400b08 <+24>:   str q0, [x2]
0x0000000000400b0c <+28>:   ldr q2, [x0,#32]
0x0000000000400b10 <+32>:   fmla    v0.4s, v1.4s, v2.4s
0x0000000000400b14 <+36>:   str q0, [x2]
0x0000000000400b18 <+40>:   ldr q2, [x0,#48]
0x0000000000400b1c <+44>:   fmla    v0.4s, v1.4s, v2.4s
0x0000000000400b20 <+48>:   str q0, [x2]
0x0000000000400b24 <+52>:   ret

Edit 2: Thats the most I obtained for now.

0x0000000000400ea0 <+0>:    ldr q1, [x1]
0x0000000000400ea4 <+4>:    ldr q0, [x0,#16]
0x0000000000400ea8 <+8>:    ldr q4, [x0]
0x0000000000400eac <+12>:   ldr q3, [x0,#32]
0x0000000000400eb0 <+16>:   fmul    v0.4s, v0.4s, v1.4s
0x0000000000400eb4 <+20>:   ldr q2, [x0,#48] 
0x0000000000400eb8 <+24>:   fmla    v0.4s, v4.4s, v1.4s
0x0000000000400ebc <+28>:   fmla    v0.4s, v3.4s, v1.4s
0x0000000000400ec0 <+32>:   fmla    v0.4s, v2.4s, v1.4s
0x0000000000400ec4 <+36>:   str q0, [x2]
0x0000000000400ec8 <+40>:   ret

There still seems to be a large overhead in the ldr calls according to perf.

Jake 'Alquimista' LEE · Accepted Answer · 2019-01-24T10:56:21.633

5

You are operating directly on pointers (call by reference basis). If you operate on pointers, you should be aware that you are completely at compiler's mercy. And compilers for ARM aren't exactly the best.

There might be compiler options dealing with this, or even compilers doing the needed optimizations out of the box, but your best bet is doing it manually:

declare local vectors (without &)
load the values from the pointer into corresponding vectors (preferably the whole matrix plus the vector)
do the math with the vectors
store the vectors to the pointer

The process above is also valid for non-neon computations. The compiler almost always gets seriously crippled by the slightest hints on (automatic) memory operations.

Remember, local variables are your best friends. And ALWAYS do the memory load/store manually.

compiler: Android clang 8.0.2 -o2

void transform(const float *matrix, const float *input, float *output)
{
    const float32x4_t input_local = vld1q_f32(input);
    const float32x4_t row0 = vld1q_f32(&matrix[0*4]);
    const float32x4_t row1 = vld1q_f32(&matrix[1*4]);
    const float32x4_t row2 = vld1q_f32(&matrix[2*4]);
    const float32x4_t row3 = vld1q_f32(&matrix[3*4]);

    float32x4_t rslt;
    rslt = vmulq_f32(row0, input_local);
    rslt = vmlaq_f32(rslt, row1, input_local);
    rslt = vmlaq_f32(rslt, row2, input_local);
    rslt = vmlaq_f32(rslt, row3, input_local);

    vst1q_f32(output, rslt);
}

; void __fastcall transform(const float *matrix, const float *input, float *output)
EXPORT _Z9transformPKfS0_Pf
_Z9transformPKfS0_Pf
matrix = X0             ; const float *
input = X1              ; const float *
output = X2             ; float *
; __unwind {
LDR             Q0, [input]
LDP             Q1, Q2, [matrix]
LDP             Q3, Q4, [matrix,#0x20]
FMUL            V1.4S, V0.4S, V1.4S
FMUL            V2.4S, V0.4S, V2.4S
FMUL            V3.4S, V0.4S, V3.4S
FADD            V1.4S, V1.4S, V2.4S
FADD            V1.4S, V3.4S, V1.4S
FMUL            V0.4S, V0.4S, V4.4S
FADD            V0.4S, V0.4S, V1.4S
STR             Q0, [output]
RET
; } // starts at 4

As you can see, Android clang 8.0.2 is quite an improvement over the previous versions when it comes to neon codes. Finally the compiler generates codes loading multiple registers. Why it doesn't like FMLA is beyond me though.

edited Jan 24 '19 at 10:56

answered Jan 24 '19 at 09:59

Jake 'Alquimista' LEE

6,197
2
17
25

"declare local vectors (without &)" Removing the reference from input_local did the trick. – Desperado17 Jan 24 '19 at 10:22
Looking at your original prototype, you can see there's no way for the compiler to know for certain that `input` and `output` can never point to the same object. That's why it has to be conservative and assume every store to `output` _might_ have mutated `input`. C has the [`restrict`](https://en.cppreference.com/w/c/language/restrict) keyword for exactly this case. – Useless Jan 24 '19 at 10:26
Ok, the last version is almost there. The only difference is that it has 5 loads instead of 3. The reason is that it tries to load the matrix in 16 byte chunks instead of 32. – Desperado17 Jan 24 '19 at 11:03
Any more suggestions? According to perf, it still spends a lot of time in ldr instructions. Should I use builtin_prefetch? – Desperado17 Jan 24 '19 at 17:31
@Desperado17 If you can, align the matrix pointer to 64bytes, and the vector pointer to 16bytes. Otherwise, I need more information about this function itself: how many times is it called?; Are the matrices and vectors in contiguous memory?; Do they both change every iteration?; etc. – Jake 'Alquimista' LEE Jan 25 '19 at 03:04
The matrix pointer is only loaded once per call. The asm code shows that the four columns are distributed on 4 q-registers. The entries of input are 16-byte aligned ( I printed out the iterator addresses). My test calls the function only once with randomly generated 10000 input entries. The compiler does not seem to emit a prefetch instruction. Adding a __builtin_prefetch in the loop body adds a prefetch instruction, but the number of cache misses according to cachegrind remains the same. – Desperado17 Jan 25 '19 at 10:02
1

@Desperado17 So, I take the function call occurs once per iteration (10000 times), both the matrix and vector change each time, and they are in a contiguous memory each, right? You should implement the loop inside the function itself iterating as many times as given by a parameter. And you should unroll the loop to do two/four multiplications inside the loop, plus an extra residual dealing part after the loop for the case of the number of given iteration being odd/not a multiple of four. This pretty much will eliminate most cache-miss/instruction latencies. – Jake 'Alquimista' LEE Jan 25 '19 at 10:28
Just as an FYI, the issue in the original case is that the input and output references are allowed to alias each other. The compiler has to assume that writing though the output reference modifies the input reference, hence the reload. The `restrict` keyword should be used as a form of API contract promise that parameters won't alias. – solidpixel Jan 25 '19 at 12:26
@solidpixel ahemmmmm, the `restrict` directive is technically deprecated in `C++`, and doing it the old school way, taking the full control is the most fool-proof way in my experience. I'm tired of looking at disassembly to check if the compiler did its job properly. – Jake 'Alquimista' LEE Jan 25 '19 at 12:33
Thanks - didn't realize that (I'm mostly a C programmer). In my experience with SEE and NEON you either write assembler or you spend you life checking the compiler did it's job properly ... – solidpixel Jan 25 '19 at 14:37

score 0 · Answer 2 · answered Jul 06 '19 at 16:56

Your output glm::vec4 & output could be a reference to the same memory as your input of the same type. Whenever you write to the output, the compiler assumes that you could be changing the input, so it loads it again from memory.

It's because of C pointer aliasing rules.

You can promise the compiler that the memory pointed to by output will never be accessed through any other pointer (or reference, in this case) with the restrict keyword:

void transform (
   glm::mat4 const & matrix,
   glm::vec4 const & input,
   glm::vec4 & __restrict output)

Then the extra loads disappear. Here is the compiler output (godbolt) (try removing the __restrict).

ARM neon optimization - getting rid of superfluous loads

2 Answers2