I am implementing a single-stage biquad filter on a STM32H743 processor with an ARM Cortex-M7. I am using the GCC compiler from the ARM Embedded Toolchain to compile my code.
I want to optimize the code as far as it is reasonably possible without actually hand-writing assembler code - partly because I am actually tight on clock cycles, but also because I want to learn how to optimise C code on embedded platforms.
People often go on about how advanced modern compilers are and that you shouldn't try to outsmart them - and this may be true in most situations, but here I seem to have encountered a situation where GCC doesn't seem to able to optimize this quite as well as it could.
All posted benchmark results are obtained using the on-chip cycle counter. -O3
optimization was enabled in all cases.
--edit: all code can now also be found in this gist: https://gist.github.com/Jonarw/dcd832095919715c65cdf3f4241617c0
This is my code in unoptimized form:
// b0, b1, b2, a1, a2, x1, x2, y1, y2 are initialized beforehand
// signal is a pointer to a block of memory in DTCM RAM
for (uint16_t i = 0; i < 4096; i++)
{
x0 = signal[i];
y0 = b0 * x0 + b1 * x1 + b2 * x2 + a1 * y1 + a2 * y2;
signal[i] = y0;
x2 = x1;
x1 = x0;
y2 = y1;
y1 = y0;
}
This takes 94313
CPU cycles to execute according to my benchmark.
Loop unrolling appears to be a crucial strategy to increase performance on ARM Cortex CPUs, the CMSIS library in particular uses this extensively. So I tried this:
uint16_t c = 0;
for (uint16_t i = 0; i < 4096 / 8; i++)
{
for (uint8_t i2 = 0; i2 < 8; i2++)
{
x0 = signal[c];
y0 = b0 * x0 + b1 * x1 + b2 * x2 + a1 * y1 + a2 * y2;
signal[c++] = y0;
x2 = x1;
x1 = x0;
y2 = y1;
y1 = y0;
}
}
This took the execution time down to 25533
cycles - almost 4x improvement!
I assume what is happening is that the compiler completely unrolls the inner loop and therefore reduces the overhead introduced by the loop. I verified that by removing the inner loop and instead repeating its body 8 times - which gave me the exact same cycle count.
I did some research and discovered the #pragma GCC unroll n
compiler directive, which to my understanding should do exactly the same thing. So i tried this:
uint16_t c = 0;
#pragma GCC unroll 8
for (uint16_t i = 0; i < 4096; i++)
{
x0 = signal[c];
y0 = b0 * x0 + b1 * x1 + b2 * x2 + a1 * y1 + a2 * y2;
signal[c++] = y0;
x2 = x1;
x1 = x0;
y2 = y1;
y1 = y0;
}
But, to my disappointment, this took 97117
cycles, so even longer than the original.
Here are my questions:
- Do my observations make sense? Did I make a mistake with my benchmarks?
- Why doesn't GCC realize that it could unroll my loop to increase performance?
- Why is the loop overhead so significant? As I understand, the branch predictor should avoid exactly this?
- Why doesn't the
#pragma GCC unroll 8
have the same effect as my manual loop unrolling? - Is there a way (compiler flag, #pragma...) to optimize my code without manually unrolling the loop?
Edit: Thanks a lot for your input so far! I will try and answer some of your questions in the comments.
What types are the variables used? Would you be able to refactor it to a function?
Sorry I really should have included that to begin with. But somehow while trying to keep the question concise this information got lost.
All variables are float
s. The three versions described above can be seen in this gist wrapped as functions: https://gist.github.com/Jonarw/dcd832095919715c65cdf3f4241617c0
How are you measuring cycles?
I use the DWT_CYCCNT
register as outlined in this SO post: ARM M4 Instructions per Cycle (IPC) counters
You are using -mcpu=cortex-m7, right?
Yes, I am. I am using a makefile generated by STM32CubeMX. The full call to gcc looks like this:
arm-none-eabi-gcc -c -mcpu=cortex-m7 -mthumb -mfpu=fpv5-d16 -mfloat-abi=hard -DUSE_HAL_DRIVER -DSTM32H743xx -DARM_MATH_CM7 -D__FPU_PRESENT -ICore/Inc -IDrivers/STM32H7xx_HAL_Driver/Inc -IDrivers/STM32H7xx_HAL_Driver/Inc/Legacy -IDrivers/CMSIS/Device/ST/STM32H7xx/Include -IDrivers/CMSIS/Include -IDrivers/CMSIS/DSP/Include -O3 -Wall -fdata-sections -ffunction-sections --g -gdwarf-2 -MMD -MP -MF"build/Biquad.d" -Wa,-a,-ad,-alms=build/Biquad.lst Core/Src/Biquad.c -o build/Biquad.o
this is an st part so there is some form of cache in front of the flash that can mess with performance testing.
I tried to minimize the influence of caching by operating in DTCM RAM (which is not cached). I did try to call the function once before benchmarking, so it should be completely cached in ICache - makes a difference of a couple hundred cycles, nothing too crazy.
how fast are you running the chip and are there flash wait states?
Chip runs at 480 MHz. Yes there are, but I don't think flash speed is really the deciding factor because of ICache (see above).
Have you looked at the assembly code produced by the compiler?
I will try to obtain the compiler output and report back.
You could also just try using -O3 -funroll-loops for the file containing this function, to see if that helps for this loop.
I have tried this as suggested in another comment via an attribute for the function. This slightly improved the performance, but not by a lot (see the gist linked above for details).