GCC generates different code depending on array index value

Question

This code (arm):

void blinkRed(void)
{
    for(;;)
    {
        bb[0x0008646B] ^= 1;
        sys.Delay_ms(14);
    }
}

...is compiled to folowing asm-code:

08000470:   ldr r4, [pc, #20]       ; (0x8000488 <blinkRed()+24>) // r4 = 0x422191ac
08000472:   ldr r6, [pc, #24]       ; (0x800048c <blinkRed()+28>)
08000474:   movs r5, #14
08000476:   ldr r3, [r4, #0]
08000478:   eor.w r3, r3, #1
0800047c:   str r3, [r4, #0]
0800047e:   mov r0, r6
08000480:   mov r1, r5
08000482:   bl 0x80001ac <CSTM32F100C6::Delay_ms(unsigned int)>
08000486:   b.n 0x8000476 <blinkRed()+6>

It is ok.

But, if I just change array index (-0x400)....

void blinkRed(void)
{
    for(;;)
    {
        bb[0x0008606B] ^= 1;
        sys.Delay_ms(14);
    }
}

...I've got not so optimized code:

08000470:   ldr r4, [pc, #24]       ; (0x800048c <blinkRed()+28>) // r4 = 0x42218000
08000472:   ldr r6, [pc, #28]       ; (0x8000490 <blinkRed()+32>)
08000474:   movs r5, #14
08000476:   ldr.w r3, [r4, #428]    ; 0x1ac
0800047a:   eor.w r3, r3, #1
0800047e:   str.w r3, [r4, #428]    ; 0x1ac
08000482:   mov r0, r6
08000484:   mov r1, r5
08000486:   bl 0x80001ac <CSTM32F100C6::Delay_ms(unsigned int)>
0800048a:   b.n 0x8000476 <blinkRed()+6>

The difference is that in the first case r4 is loaded with target address immediately (0x422191ac) and then access to memory is performed with 2-byte instructions, but in the second case r4 is loaded with some intermediate address (0x42218000) and then access to memory is performed with 4-bytes instruction with offset (+0x1ac) to target address (0x422181ac).

Why compiler does so?

I use: arm-none-eabi-g++ -mcpu=cortex-m3 -mthumb -g2 -Wall -O1 -std=gnu++14 -fno-exceptions -fno-use-cxa-atexit -fstrict-volatile-bitfields -c -DSTM32F100C6T6B -DSTM32F10X_LD_VL

bb is:

__attribute__ ((section(".bitband"))) volatile u32 bb[0x00800000];

In .ld it is defined as: in MEMORY section:

BITBAND(rwx): ORIGIN = 0x42000000, LENGTH = 0x02000000

in SECTIONS section:

.bitband (NOLOAD) :
SUBALIGN(0x02000000)
{
    KEEP(*(.bitband))
} > BITBAND

Ummm... not sure if this is relevant, but... would the optimized version work as well? Maybe there are some preconditions for the faster load. This information would indicate whether it is the optimizer's problem, or architecture related. — luk32, May 10 '15 at 12:59
Both versions work. I cannot imagine any preconditions that are available in one of case and not available in other. I changed index step by step, and find out that second version is compiled if array index is in the range from 0x00086020 to 0x000863FF. If array index is out of this range then the first (optimized) version is compiled. — Woodoo, May 10 '15 at 13:32
As mentioned `-mcpu=cortex-m3`. It is `STM32F100C6`. `bb` is array of 32-bit words with fixed base address of 0x42000000. No alignment issues should exist. — Woodoo, May 10 '15 at 14:05
you have not posted enough, the latter (both) look like an optimization, I dont really see there being a performance difference from what you have shown. for us to know why the compiler did something you need to show all the relevant code, I suspect you have not based on what the compiler output. — old_timer, May 10 '15 at 20:44
The difference is in code size: second one generates 32-bit instruction instead of 16-bit, as first one does. Shown C++ code is a task. There is no relevant code, except task starting code `TaskRun(blinkRed);` in main. I based my conclusions on step-by-step debuging in disassembly, not on compiler output. — Woodoo, May 11 '15 at 05:15
Are you asking the difference between `08000476: ldr r3, [r4, #0]` vs `08000476: ldr.w r3, [r4, #428]` ? As in why one is a 16bit and other one is 32bit instruction? In that case its just one instruction needs to save #428 somewhere... — auselen, May 11 '15 at 09:21
No, @auselen, I know difference between these instructions. I am asking why compiler uses `ldr r3, [r4, #0]` in first case and `ldr.w r3, [r4, #428]` in second case, while there is no difference in source code. — Woodoo, May 11 '15 at 09:40
Because -Os causes instruction sequence optimization even for volatile access. For example if I want to access DMA registers I must turn DMA on with `io.rcc.peripheral.enable.ahb.dma1 = true;` command. Then I access registers with `io.dma1.channel[0].config.msize = 0`. Unfortunately for read-modification-write instructions compiler performs all reads, then all modifications and finally all writes. So, it causes reading of DMA registers, while DMA is still off. — Woodoo, May 11 '15 at 10:13
One possible reason is that the second address is such that using the offset allows some other piece of code to share the same literal base address, thus saving 4 (or more) bytes elsewhere, but you'd need to look at the _entire_ disassembly for that sort of thing. Another possible reason is the unhelpful, but not all that uncommon, "weird corner cases in GCC's code generator" one. — Notlikethat, May 12 '15 at 08:29
Concerning -Os, You could use a barrier to force the compiler to "commit" its code=> asm volatile("" : : : "memory"); <= That way, no reordering will happen. — xryl669, May 12 '15 at 13:10
@Notlikethat, I thought about it, but there is not another piece of code that uses address near address my code uses. Moreover, I added new piece of code that uses same array, but next element, that means +4 in offset... I was very surprised, that for new piece of code compiler did not shared already existed base address, but created new one... — Woodoo, May 12 '15 at 16:06
@xryl669, yes, I saw this here [link](http://stackoverflow.com/questions/22106843/gccs-reordering-of-read-write-instructions). But there is also said: "The C language rules are such that GCC is forbidden from reordering volatile loads and store memory accesses with respect to each other, or deleting them." I think it is also applicable for C++. I dont wanna add to my programm some strange code like `asm volatile("" : : : "memory");` if I already use volatile. But thanks for advice. — Woodoo, May 12 '15 at 16:28

score 1 · Answer 1 · answered May 31 '15 at 09:32

I would consider it an artefact/missing optimization opportunity of -O1.

It can be understood in more detail if we look at the code generated with -O- to load bb[...]:

First case:

movw    r2, #:lower16:bb
movt    r2, #:upper16:bb
movw    r3, #37292
movt    r3, 33
adds    r3, r2, r3
ldr r3, [r3, #0]

Second case:

movw    r3, #:lower16:bb
movt    r3, #:upper16:bb
add r3, r3, #2195456       ; 0x218000    = 4*0x86000
add r3, r3, #428
ldr r3, [r3, #0]

The code in the second case is better and it can be done this way because the constant can be added with two add instructions (which is not the case if the index is 0x0008646B).

-O1 does only optimizations which are not time consuming. So apparently it merges early the add and the ldr so it misses later the opportunity to load the whole address with one pc relative ldr.

Compile with -O2 (or -fgcse) and the code looks like expected.

GCC generates different code depending on array index value

1 Answers1