Why loop unroll brings so much speedup on ARM Cortex-a53?

Question

I'm playing around with loop unroll with the following code on a ARM Cortex-a53 processor running in AArch64 state:

void do_something(uint16_t* a, uint16_t* b, uint16_t* c, size_t array_size)
{
  for (int i = 0; i < array_size; i++)
  {
    a[i] = a[i] + b[i];
    c[i] = a[i] * 2;
  }
}

With flag -O1, I got the following assembly,

.L3:
    ldrh    w3, [x0, x4]
    ldrh    w5, [x1, x4]
    add     w3, w3, w5
    and     w3, w3, 65535
    strh    w3, [x0, x4]
    ubfiz   w3, w3, 1, 15
    strh    w3, [x2, x4]
.LVL2:
    add x4, x4, 2
.LVL3:
    cmp x4, x6
    bne .L3

which finished in 162ms (the size of a, b, c are big). For simplicity I left out some prolog and epilog codes before the loop, but they are just for stack setup, etc.

Then I unrolled the loop which results in code like the following:

void add1_opt1(uint16_t* a, uint16_t* b, uint16_t* c, size_t array_size)
{
  for (int i = 0; i < array_size/4; i+=4)
  {
    a[i]   = a[i] + b[i];
    c[i]   = a[i] * 2;
    a[i+1] = a[i+1] + b[i+1];
    c[i+1] = a[i+1] * 2;
    a[i+2] = a[i+2] + b[i+2];
    c[i+2] = a[i+2] * 2;
    a[i+3] = a[i+3] + b[i+3];
    c[i+3] = a[i+3] * 2;
  }
}

which gives assembly like the following (still with -O1, since with -O0 the compiler was doing something kind of stupid):

.L7:
    ldrh    w1, [x0]
    ldrh    w5, [x3]
    add w1, w1, w5
    and w1, w1, 65535
    strh    w1, [x0]
    ubfiz   w1, w1, 1, 15
    strh    w1, [x2]
    ldrh    w1, [x0, 2]
    ldrh    w5, [x3, 2]
    add w1, w1, w5
    and w1, w1, 65535
    strh    w1, [x0, 2]
    ubfiz   w1, w1, 1, 15
    strh    w1, [x2, 2]
    ldrh    w1, [x0, 4]
    ldrh    w5, [x3, 4]
    add w1, w1, w5
    and w1, w1, 65535
    strh    w1, [x0, 4]
    ubfiz   w1, w1, 1, 15
    strh    w1, [x2, 4]
    ldrh    w1, [x0, 6]
    ldrh    w5, [x3, 6]
    add w1, w1, w5
    and w1, w1, 65535
    strh    w1, [x0, 6]
    ubfiz   w1, w1, 1, 15
    strh    w1, [x2, 6]
.LVL8:
    add x4, x4, 4
.LVL9:
    add x0, x0, 8
    add x3, x3, 8
    add x2, x2, 8
    cmp x4, x6
    bcc .L7

which was almost like copying and pasting the other assembly code 4 times. And the question is, why this piece of code only took 28ms to run, which was like 5x speed up. With simple loop condition like this, I assumed the branch prediction should do a pretty good job in both codes right? And in the second assembly code, the stores were also interleaved. So I cannot imagine how such code can get that much speedup.

How did you measure the performance of the code? Please provide the full benchmark harness if possible. — fuz, Dec 26 '20 at 01:20
After timing the first code, I replace the function call with the second one and rebuild everything and run again. I assume there shouldn’t be any warm cache issue? If I reverse the run order the results are similar. — Da Teng, Dec 26 '20 at 01:27
That still doesn't explain how exactly you benchmark this code. To reproduce your results, I need your full benchmarking code so I can compile and run it on my own computer and run anaylses. This is called making a [mcve]. Keep this in mind for your next question! — fuz, Dec 26 '20 at 01:35
Why only `-O1`, instead of normal optimization with `-O2` or full optimization with `-O3` (with auto-vectorization which should help ere)? (And yes, of course [`-O0` is terrible, it's for fully consistent debugging](https://stackoverflow.com/q/53366394), useless for performance.) BTW, `gcc -O3 -funroll-loops` can even unroll loops for you. (Also enabled as part of `-fprofile-use`). — Peter Cordes, Dec 26 '20 at 02:04
I wouldn't recommend unrolling every loop, as it might pollute the I-Cache unnecessarily, instead use `#pragma GCC unroll ` only where needed. Always use the latest GCC when compiling for ARM, as a lot of effort has been put into optimizing ARM codegen in the last few years and specify the arch correctly (e.g. `-O3 -march=armv8-a+simd+sb+predres`) — rustyx, Dec 26 '20 at 22:16

score 2 · Accepted Answer · answered Dec 26 '20 at 01:23

2

The problem is here: for (int i = 0; i < array_size/4; i+=4).

Looping until array_size/4 will do a quarter of the work.

It should have been for (int i = 0; i < array_size; i+=4).

Then you should see a more explainable speedup of a few percent.

answered Dec 26 '20 at 01:23

rustyx

80,671
25
200
267

1

Aah.. you are right. It’s a stupid question. Never write code late night for me again is the lesson I learned. – Da Teng Dec 26 '20 at 01:31
@DaTeng: If ARM has them, hardware performance counters (for an event like L1d cache loads) can sanity check that your code is doing the same amount of work, if you know how wide the loads are in different versions of your loop. Or just counts for branch instructions; that would have caught the fact you were doing 16x fewer branches, instead of 4x. – Peter Cordes Dec 26 '20 at 05:57

Why loop unroll brings so much speedup on ARM Cortex-a53?

1 Answers1