Examples of 'falign-loops' optimisation occuring?

Question

One pass run by the compiler when optimising in gcc is falign-loops.

Although a vague description is provided here: https://www.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/compiler-options/compiler-option-details/data-options/falign-loops-qalign-loops.html

It is listed as one of the optimisations occurring with the -O2 flag here: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

I have been unable to actually see it work in action with any piece of code I have tried using compiler explorer. Does anyone know how the flag functions and perhaps have some explicit examples?

Thanks

In gcc's doc, "If n is not specified or is zero, use a machine-dependent default which is very likely to be ‘1’, meaning no alignment. The maximum allowed n option value is 65536." I believe `-O2` uses default value which is probably `1`. Please [edit] to provide your test code on godbolt. — Louis Go, May 09 '22 at 07:23
[Could you use C inline assembly to align instructions? (without Compiler optimizations)](https://stackoverflow.com/q/72005117) shows some real examples of GCC asm output, including one where `-falign-loops=16` produced a `.p2align 4,,10` / `.p2align 3` before a loop. (Note that Godbolt filters directives by default; perhaps that was your problem? Either way, possible duplicate, unless Louis's comment is the key instead.) — Peter Cordes, May 09 '22 at 08:17

score 1 · Answer 1 · answered Feb 24 '23 at 21:37

Over-aligning loops just wastes NOP cycles and may pollute caches. However, under-aligning loops, on some small cores like ARM Cortex-M4, will cost an additional cycle. Here's a simple example:

unsigned summer( unsigned accum, int * ptr, unsigned N )
{
    for(unsigned i = 0; i < N; ++i )
    {
        accum += *ptr++;
    }
    return accum;
}

Here's the toolchain I used:

rsaxvc@toolbox:~$ arm-linux-gnueabihf-gcc-8 --version
arm-linux-gnueabihf-gcc-8 (Ubuntu/Linaro 8.4.0-3ubuntu1) 8.4.0
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

When I build with defaults, the top of the loop(loc_6) is aligned on 0x6, which is not 32-bit aligned. However, the instruction at the top of the loop(LDR.W with post-increment) is a 32-bit instruction. This causes an additional cycle for branching to a misaligned 32-bit instruction. I used IDA disassembler:

$ arm-linux-gnueabihf-gcc-8 -O3 -mcpu=cortex-m4 test.c -c

.text:00000000 ; Processor       : ARM
.text:00000000 ; ARM architecture: ARMv7E-M
.text:00000000 ; Target assembler: Generic assembler for ARM
.text:00000000 ; Byte sex        : Little endian
.text:00000000
.text:00000000 ; ===========================================================================
.text:00000000 ; Segment type: Pure code
.text:00000000                 AREA .text, CODE
.text:00000000                 CODE16
.text:00000000 ; =============== S U B R O U T I N E =======================================
.text:00000000                 EXPORT summer
.text:00000000 summer
.text:00000000
.text:00000000 var_4           = -4
.text:00000000
.text:00000000                 CBZ     R2, locret_18
.text:00000002                 PUSH    {R4}
.text:00000004                 MOVS    R3, #0
.text:00000006
.text:00000006 loc_6                                   ; CODE XREF: summer+10j
.text:00000006                 LDR.W   R4, [R1],#4
.text:0000000A                 ADDS    R3, #1
.text:0000000C                 CMP     R2, R3
.text:0000000E                 ADD     R0, R4
.text:00000010                 BNE     loc_6
.text:00000012                 LDR.W   R4, [SP+4+var_4],#4
.text:00000016                 BX      LR
.text:00000018 ; ---------------------------------------------------------------------------
.text:00000018
.text:00000018 locret_18                               ; CODE XREF: summerj
.text:00000018                 BX      LR
.text:00000018 ; End of function summer

When I build with -falign-loops=4, a NOP is inserted just before the top of the loop(loc_8) is aligned on 0x8, which is now 32-bit aligned. Since the 32-bit instruction at the top of the loop is now suitably aligned, there's no additional penalty cycle. However, there's now an unconditional NOP.

$ arm-linux-gnueabihf-gcc-8 -O3 -mcpu=cortex-m4 test.c -c -falign-loops=4

.text:00000000 ; Processor       : ARM
.text:00000000 ; ARM architecture: ARMv7E-M
.text:00000000 ; Target assembler: Generic assembler for ARM
.text:00000000 ; Byte sex        : Little endian
===========================================================================
.text:00000000 ; Segment type: Pure code
.text:00000000                 AREA .text, CODE
.text:00000000                 CODE16
.text:00000000 ; =============== S U B R O U T I N E =======================================
.text:00000000                 EXPORT summer
.text:00000000 summer
.text:00000000
.text:00000000 var_4           = -4
.text:00000000
.text:00000000                 CBZ     R2, locret_1A
.text:00000002                 PUSH    {R4}
.text:00000004                 MOVS    R3, #0
.text:00000006                 NOP
.text:00000008
.text:00000008 loc_8                                   ; CODE XREF: summer+12j
.text:00000008                 LDR.W   R4, [R1],#4
.text:0000000C                 ADDS    R3, #1
.text:0000000E                 CMP     R2, R3
.text:00000010                 ADD     R0, R4
.text:00000012                 BNE     loc_8
.text:00000014                 LDR.W   R4, [SP+4+var_4],#4
.text:00000018                 BX      LR
.text:0000001A ; ---------------------------------------------------------------------------
.text:0000001A
.text:0000001A locret_1A                               ; CODE XREF: summerj
.text:0000001A                 BX      LR
.text:0000001A ; End of function summer

This trade-off becomes a bigger issue with nested loops where the counts aren't known at compile-time. The LLVM people discuss these issues with instructions alignment here: https://reviews.llvm.org/D51780 . It's most apparent when using very short loops like these. I ran into Cortex-M4 taking an extra cycle to a misaligned loop when optimizing a simple blit of maybe a few more instructions.

rsaxvc · Answer 2 · 2023-03-10T04:45:14.237

When I build this example function:

unsigned summer( unsigned accum, int * ptr, unsigned N )
{
    for(unsigned i = 0; i < N; ++i )
    {
        accum += *ptr++;
    }
    return accum;
}

With compiler-explorer's ARM gcc 8.5(linux), CFLAGS="-O3 -Wall -Wextra -mcpu=cortex-m4 -falign-loops=4", at first I don't see evidence of loop alignment:

summer(unsigned int, int*, unsigned int):
    cbz     r2, .L9
    push    {r4}
    movs    r3, #0
.L3:
    ldr     r4, [r1], #4
    adds    r3, r3, #1
    cmp     r2, r3
    add     r0, r0, r4
    bne     .L3
    pop     {r4}
    bx      lr
.L9:
    bx      lr

After unchecking "Filter->Directives" I see a lot more, here's just the function with unrelated directives removed by hand:

summer(unsigned int, int*, unsigned int):
    cbz     r2, .L9
    push    {r4}
    movs    r3, #0
.LVL1:
    .p2align 2 #Align instructions to 2(number) to the power of 2(because .p2align)
.L3:
    ldr     r4, [r1], #4
    adds    r3, r3, #1
    cmp     r2, r3
    add     r0, r0, r4
    bne     .L3
    pop     {r4}
    bx      lr
.L9:
    bx      lr

But we don't really see the effect of .p2align yet. Re-enabling Filter->Directives and also checking Output->Compile to binary object" we see the additional inserted NOP that's added with -falign-loops=4:

summer(unsigned int, int*, unsigned int):
    cbz r2, 18 <summer(unsigned int, int*, unsigned int)+0x18>
    push    {r4}
    movs    r3, #0
    nop
    ldr.w   r4, [r1], #4
    adds    r3, #1
    cmp r2, r3
    add r0, r4
    bne.n   8 <summer(unsigned int, int*, unsigned int)+0x8>
    pop {r4}
    bx  lr
    bx  lr
    nop

Now that we see what it is, could we improve it? Perhaps some cores would prefer we combine "movs r3, #0" and "nop" into a single 32-bit wide instruction "movs.w r3,#0". Currently the NOP only applies once per function call, rather than the misaligned 32-bit instruction penalty per loop iteration.

If you don't use "compile to binary", you have to untick the "filter directtives" option on Godbolt so you can see the `.p2align` directive GCC emits. Of course you can't see that actual addresses of instructions without compiling to a binary, but you can see the effect of `-falign-loops=4` without it. — Peter Cordes, Feb 24 '23 at 22:30
Thanks @PeterCordes, I saw that button but thought it was filter-in not filter-out. I've updated the answer to show how each setting impacts the output. — rsaxvc, Mar 09 '23 at 19:18

Examples of 'falign-loops' optimisation occuring?

2 Answers2