Why doesn't GCC optimize this loop?

Question

I am implementing a single-stage biquad filter on a STM32H743 processor with an ARM Cortex-M7. I am using the GCC compiler from the ARM Embedded Toolchain to compile my code.

I want to optimize the code as far as it is reasonably possible without actually hand-writing assembler code - partly because I am actually tight on clock cycles, but also because I want to learn how to optimise C code on embedded platforms.

People often go on about how advanced modern compilers are and that you shouldn't try to outsmart them - and this may be true in most situations, but here I seem to have encountered a situation where GCC doesn't seem to able to optimize this quite as well as it could.

All posted benchmark results are obtained using the on-chip cycle counter. -O3 optimization was enabled in all cases.

--edit: all code can now also be found in this gist: https://gist.github.com/Jonarw/dcd832095919715c65cdf3f4241617c0

This is my code in unoptimized form:

// b0, b1, b2, a1, a2, x1, x2, y1, y2 are initialized beforehand
// signal is a pointer to a block of memory in DTCM RAM
for (uint16_t i = 0; i < 4096; i++)
{
    x0 = signal[i];
    y0 = b0 * x0 + b1 * x1 + b2 * x2 + a1 * y1 + a2 * y2;
    signal[i] = y0;

    x2 = x1;
    x1 = x0;
    y2 = y1;
    y1 = y0;
}

This takes 94313 CPU cycles to execute according to my benchmark.

Loop unrolling appears to be a crucial strategy to increase performance on ARM Cortex CPUs, the CMSIS library in particular uses this extensively. So I tried this:

uint16_t c = 0;
for (uint16_t i = 0; i < 4096 / 8; i++)
{
    for (uint8_t i2 = 0; i2 < 8; i2++)
    {
        x0 = signal[c];
        y0 = b0 * x0 + b1 * x1 + b2 * x2 + a1 * y1 + a2 * y2;
        signal[c++] = y0;

        x2 = x1;
        x1 = x0;
        y2 = y1;
        y1 = y0;
    }
}

This took the execution time down to 25533 cycles - almost 4x improvement!

I assume what is happening is that the compiler completely unrolls the inner loop and therefore reduces the overhead introduced by the loop. I verified that by removing the inner loop and instead repeating its body 8 times - which gave me the exact same cycle count.

I did some research and discovered the #pragma GCC unroll n compiler directive, which to my understanding should do exactly the same thing. So i tried this:

uint16_t c = 0;
#pragma GCC unroll 8
for (uint16_t i = 0; i < 4096; i++)
{
    x0 = signal[c];
    y0 = b0 * x0 + b1 * x1 + b2 * x2 + a1 * y1 + a2 * y2;
    signal[c++] = y0;

    x2 = x1;
    x1 = x0;
    y2 = y1;
    y1 = y0;
}

But, to my disappointment, this took 97117 cycles, so even longer than the original.

Here are my questions:

Do my observations make sense? Did I make a mistake with my benchmarks?
Why doesn't GCC realize that it could unroll my loop to increase performance?
Why is the loop overhead so significant? As I understand, the branch predictor should avoid exactly this?
Why doesn't the #pragma GCC unroll 8 have the same effect as my manual loop unrolling?
Is there a way (compiler flag, #pragma...) to optimize my code without manually unrolling the loop?

Edit: Thanks a lot for your input so far! I will try and answer some of your questions in the comments.

What types are the variables used? Would you be able to refactor it to a function?

Sorry I really should have included that to begin with. But somehow while trying to keep the question concise this information got lost.

All variables are floats. The three versions described above can be seen in this gist wrapped as functions: https://gist.github.com/Jonarw/dcd832095919715c65cdf3f4241617c0

How are you measuring cycles?

I use the DWT_CYCCNT register as outlined in this SO post: ARM M4 Instructions per Cycle (IPC) counters

You are using -mcpu=cortex-m7, right?

Yes, I am. I am using a makefile generated by STM32CubeMX. The full call to gcc looks like this:

arm-none-eabi-gcc -c -mcpu=cortex-m7 -mthumb -mfpu=fpv5-d16 -mfloat-abi=hard -DUSE_HAL_DRIVER -DSTM32H743xx -DARM_MATH_CM7 -D__FPU_PRESENT  -ICore/Inc -IDrivers/STM32H7xx_HAL_Driver/Inc -IDrivers/STM32H7xx_HAL_Driver/Inc/Legacy -IDrivers/CMSIS/Device/ST/STM32H7xx/Include -IDrivers/CMSIS/Include -IDrivers/CMSIS/DSP/Include -O3 -Wall -fdata-sections -ffunction-sections --g -gdwarf-2 -MMD -MP -MF"build/Biquad.d" -Wa,-a,-ad,-alms=build/Biquad.lst Core/Src/Biquad.c -o build/Biquad.o

this is an st part so there is some form of cache in front of the flash that can mess with performance testing.

I tried to minimize the influence of caching by operating in DTCM RAM (which is not cached). I did try to call the function once before benchmarking, so it should be completely cached in ICache - makes a difference of a couple hundred cycles, nothing too crazy.

how fast are you running the chip and are there flash wait states?

Chip runs at 480 MHz. Yes there are, but I don't think flash speed is really the deciding factor because of ICache (see above).

Have you looked at the assembly code produced by the compiler?

I will try to obtain the compiler output and report back.

You could also just try using -O3 -funroll-loops for the file containing this function, to see if that helps for this loop.

I have tried this as suggested in another comment via an attribute for the function. This slightly improved the performance, but not by a lot (see the gist linked above for details).

Have you looked at the assembly code produced by the compiler? `gcc -S ...` produces assembly files instead of object ones. `gcc -fdump-tree-all` also comes handy - it dumps all the intermediate representations of the program after each stage of the compilation pipeline. — Hristo Iliev, Oct 15 '20 at 12:25
What types are the variables used? Would you be able to refactor it to a function and post an [MCVE]? How are you measuring cycles? — KamilCuk, Oct 15 '20 at 12:25
*how advanced modern compilers are and that you shouldn't try to outsmart them* - Indeed, they're complex pieces of machinery that usually make pretty good code, but rarely optimal, and often need help to know which code to optimize aggressively (especially when that costs more code size). GCC `-O3` defaults to never unrolling loops (except for fully peeling short ones). `-funroll-loops` is enabled as part of profile-guided optimization, which you should try using. (`-O3 -fprofile-generate` / run it / `-O3 -fprofile-use`.) — Peter Cordes, Oct 15 '20 at 12:27
You could also just try using `-O3 -funroll-loops` for the file containing this function, to see if that helps for this loop. (You might not want that for your whole program, depending on available flash size or i-cache size...) You are using `-mcpu=cortex-m7`, right? Does your M7 have DSP extensions, SIMD within a 32-bit register? (Although that's hard to use with a serial dependency like this.) — Peter Cordes, Oct 15 '20 at 12:30
Also, `gcc -fopt-info-loop-all` could provide some useful diagnostics. GCC uses cost-based models of which optimisation is worth it and which not and those may be off for certain hardware. — Hristo Iliev, Oct 15 '20 at 12:32
*Why is the loop overhead so significant? As I understand, the branch predictor should avoid exactly this?* - there are still instructions to actually increment the loop counter and branch, which are only needed 1/8 as often when unrolling. More importantly, the `mov` instructions to copy registers for the `x2 = x1` logic and so on can go away, just remembering which C variable / value each register currently holds. e.g. Fibonacci can be `x+=y; y+=x;` (2 add instructions) or it can be one `add` and 2 `mov` without unrolling. — Peter Cordes, Oct 15 '20 at 12:36
Need to see a complete example, how exactly did you measure time. Did you adjust the alignment (likely not as this is C). Branch prediction is not necessarily what people think it is and it doesnt always help, need to read the arm docs on what if any specific feature it has and if you have enabled it, etc... — old_timer, Oct 15 '20 at 14:45
this is an st part so there is some form of cache in front of the flash that can mess with performance testing. it is impressive as to how it works it is actually hard to get numbers that are not improved by it. but also try this code in sram as well to tune between the cortex-m itself and the system. also do you have clock divisors interfering? how fast are you running the chip and are there flash wait states? — old_timer, Oct 15 '20 at 14:47
If you want to tune you cant rely on the compiler, yes it is true many people think you cant beat the compiler, but the reality is you can, often, with hand written asm. Normally though you strive not to unless you have to and you ideally want to let the compiler go first then tune that code if it isnt horrible. — old_timer, Oct 15 '20 at 14:49
Have you tried 32 bit variables at least as a benchmark it may or may not dramatically improve execution speed. — old_timer, Oct 15 '20 at 14:49
we are unable to either see the compiler output in your question or better if you provided enough code so we could re-compile it ourselves and see the output (compare it to yours of course when you post it) and then tweak ourselves. Or even run it for those of us that have same or similar chips... — old_timer, Oct 15 '20 at 14:51
Hmm, that's strange, for x86-64 `__attribute__((optimize("-funroll-loops")))` leads to an unroll by 4x, with lots of `movaps` for each iteration. But `-funroll-loops` on the command line, it unrolls by 5 (`add rdi, 20`) and optimizes away lots of reg copies. (https://godbolt.org/z/854oej). So the attribute isn't behaving the same. — Peter Cordes, Oct 16 '20 at 06:23
ARM looks similar: the attribute have lots of `vmov.f32` in the output, vs. `-funroll-loops` having mostly `vfma.f32`. (Unrolling by 7 with `adds r0, r0, #28`). https://godbolt.org/z/oeP756. I used g++ 9.2.1 `-O3 -Wall -mcpu=cortex-m7 -mfloat-abi=softfp`. Without the float-abi option to tell it to use hardware FP, every FP operation calls a library function. — Peter Cordes, Oct 16 '20 at 06:27

Why doesn't GCC optimize this loop?

0 Answers0