GCC does not optimise a struct copy of uninitialised static const

Question

First off I am developing for a microcontroller so RAM and ROM usage are priorities.

I realise this may read as a bug report or not specific enough. If I don't get any answers here I will file it as such.

I like using static const structs to initialise stack structures to defaults. In most cases the default struct is all zeros. I prefer to do this with static const structs rather than a memset (memset or struct assignment, static const assignment)

My current toolchain is arm-none-eabi-gcc-4_7_3, compiling for a Cortex M4 target with optimisation -Os.

I have noticed the following; GCC produces different code if I explicitly initialise my static const struct to zero than if I do not (static const struct foo; vs static const struct foo = {0};). In particular, it allocates the uninitialised static const structs to memory and performs copy operations.

Here is a code sample:

struct foo {int foo; int bar;};
struct bar {int bar[20];};

static const struct foo foo1_init, foo2_init = {0};
static const struct bar bar1_init, bar2_init = {0};

extern struct foo foo1, foo2;
extern struct bar bar1, bar2;

void init_foo1(void)
{
    foo1 = foo1_init;
}

void init_foo2(void)
{
    foo2 = foo2_init;
}

void init_bar1(void)
{
    bar1 = bar1_init;
}

void init_bar2(void)
{
    bar2 = bar2_init;
}

Compiled, this produces the following assembler listing (rearranged and trimmed for brevity):

 396                    .section    .bss.foo1_init,"aw",%nobits
 397                    .align  2
 398                    .set    .LANCHOR0,. + 0
 401                foo1_init:
 402 0000 00000000      .space  8
 402      00000000 

  40                .L2:
  41 0010 00000000      .word   .LANCHOR0
  42 0014 00000000      .word   foo1

  55:                    ****   foo1 = foo1_init;
  32                    .loc 1 55 0
  33 0000 034A          ldr r2, .L2
  34 0002 044B          ldr r3, .L2+4
  35 0004 92E80300      ldmia   r2, {r0, r1}
  36 0008 83E80300      stmia   r3, {r0, r1}


  67                .L5:
  68 000c 00000000      .word   foo2

  60:                    ****   foo2 = foo2_init;
  60 0000 024B          ldr r3, .L5
  61 0002 0022          movs    r2, #0
  62 0004 1A60          str r2, [r3, #0]
  63 0006 5A60          str r2, [r3, #4]


 389                    .section    .bss.bar1_init,"aw",%nobits
 390                    .align  2
 391                    .set    .LANCHOR1,. + 0
 394                bar1_init:
 395 0000 00000000      .space  80
 395      00000000 
 395      00000000 
 395      00000000 
 395      00000000 

  98                .L8:
  99 0010 00000000      .word   .LANCHOR1
 100 0014 00000000      .word   bar1

  65:                    ****   bar1 = bar1_init;
  89                    .loc 1 65 0
  90 0002 0349          ldr r1, .L8
  91 0004 0348          ldr r0, .L8+4
  92 0006 5022          movs    r2, #80
  93 0008 FFF7FEFF      bl  memcpy


 130                .L11:
 131 0010 00000000      .word   bar2

 70:                    ****    bar2 = bar2_init;
 121                    .loc 1 70 0
 122 0002 0021          movs    r1, #0
 123 0004 5022          movs    r2, #80
 124 0006 0248          ldr r0, .L11
 125 0008 FFF7FEFF      bl  memset

We can see that for foo2 = init_foo2 and bar2 = init_bar2 the compiler has optimised the copies down to storing zeros to foo2 directly or calling memset for bar2.

We can see that for foo1 = init_foo1 and bar1 = init_bar1 the compiler is performing explicit copies, loading to and saving from registers for foo1 and calling memcpy for foo2.

I have a few questions:

Is this expected GCC operation? I would expect the uninitialised static const structs to follow the same path inside GCC as the initialised static const structs and so produce the same output.
Does this happen for other versions of ARM GCC? I do not have other versions to hand, and all online C to assembly compilers are in fact C++ compilers.
Does this happen for other target architectures of GCC? Again, I do not have other versions to hand.

Could you edit your code for consistency? It currently refers to `foo1_init` etc., which are not defined in your code (it defines `init_foo1` instead). I guess it's just a typo, as you have `init_foo1` as both a variable and a function in the same scope. — Ian Abbott, Feb 05 '16 at 13:20
A call to `memcpy()` is pretty cheap in terms of space, have you compared that to what it would cost to inline the copies? Perhaps there's a heuristic that emits the call when the number of bytes is large enough. — unwind, Feb 05 '16 at 13:31
@Ian Indeed a typo. I originally had a single function named something else but that made the assembly output difficult to comprehend. — Iain Rist, Feb 05 '16 at 13:31
1. I suspect it is because the uninitialized variables are only _tentatively_ defined, and the compiler is generating code that does not care whether the variable is fully defined or not. (I.e., it is not checking to see if the variable gets fully defined with an initializer later on in the translation unit.) — Ian Abbott, Feb 05 '16 at 13:33
@unwind But instantiating a zero initialised struct to then copy it later is a waste of space. — Iain Rist, Feb 05 '16 at 13:33
I agree with @IanAbbott, and if that is the case the compiler behaves correctly because you defined `foo2_init ` and `bar2_init ` expressly as `constant` and always ==0. So the correct optimization on copy is to zero the destination array (using `memset`). On the other hand `foo1_init ` and `bar1_init ` are `constant`, but of unknown contents, so the compiler try to preserve that content copying it to destination. P.S. **The compiler knows only the translation already done, don't care of what is defined or initialized after the using point.** — Frankie_C, Feb 05 '16 at 13:36
@unwind I was using _tentative_ as described in the C standard. — Ian Abbott, Feb 05 '16 at 13:37
@IanRist In the assembler output you posted, it is the uninitialized variables that are wasting space in BSS. The initialized variables seem to have been optimized out. — Ian Abbott, Feb 05 '16 at 13:40
It would be interesting to see whether a more aggressive optimization level would enable the compiler to tell that the uninitialized variable is never fully defined later on in the translation unit, and optimize it away in the same way as the initialized variables, but that would require an extra pass. — Ian Abbott, Feb 05 '16 at 13:44
@IanAbbott My understanding is the uninitialised variables are in BSS, indicated by the lines `.section .bss.foo1_init,"aw",%nobits` and `.section .bss.bar1_init,"aw",%nobits`. — Iain Rist, Feb 05 '16 at 13:45
@IanAbbott Optimisation level `-O3` produces essentially the same assembly — Iain Rist, Feb 05 '16 at 13:50
Sure it does. IIRC one of the newer gcc versions has an option to put also explicitly all-zero initialised static variables into the BSS. Left that apart you really should use at least gcc 4.8, as that greatly enhances debugging capabilities on still optimized code. And gcc 4.9 added an option to optimized for (slow) Flash. — too honest for this site, Feb 05 '16 at 13:55
There is also a third option to clear a structure to zeros using GNU C: `foo1 = (struct foo){ .foo = 0 };`. Naming (any one) of the fields to zero silences any warnings even at `-Wall -Wextra`, and it does not matter which field you name, the rest of the structure is initialized to zeros. On arm-none-eabi-gcc 4.9.3 `-Os`, this generates a `memset()` call for `struct bar`; `struct foo` is small enough to zero out using two `str` mnemonics. — Nominal Animal, Feb 05 '16 at 14:28
@NominalAnimal Compound literals are indeed an option, but not my prefered one, as I want to be able to change the struct initialiser values at a later date without having to search the entire codebase. — Iain Rist, Feb 05 '16 at 14:43
@IainRist: Then use a macro that expands to a compound literal, similar to e.g. `PTHREAD_MUTEX_INITIALIZER`. — Nominal Animal, Feb 05 '16 at 17:41
@IainRist - "@IanAbbott My understanding is the uninitialised variables are in BSS [...]". Yes, that's what I wrote, and that they were wasting space, unlike the "initialized to zero" variables, which, judging by the abridged assembly listing you posted, have been optimized out. Or do those still appear in some other section? — Ian Abbott, Feb 05 '16 at 17:44
@IanAbbott If you do provide definitions to complete the tentative definitions, at the end of the file, then the code *does* optimize based on those definitions. So I don't think that is sufficient explanation yet. It is almost as if gcc thinks there is some other way that the tentative definitions might be completed. Which is true if they had external linkage (gcc implements C11 J.5.11 extension) , but for internal linkage objects there is no possible way that can happen. I would like to hear from gcc developers about this, I wonder if a bug report has been submitted. gcc 7 is still bug. — M.M, May 10 '17 at 07:10

score -1 · Answer 1 · answered May 10 '17 at 06:45

I've tested on amd64 and to my surprise it looks like a consistent behavior (but I don't know if it's a bug). gcc places foo1_init and bar1_init in the common data segment, or the segment of zero-initialized values by the operating system (.bss). foo2_init and bar2_init are put in the read-only segment (.rodata) as if they were non-zero initialized values. It's possible to see this by using -O0. Because you're not using an OS, the OS initialized section is hand initialized by gcc and/or the linker and are then copied. gcc optimizes rodata values by making the direct memset and eliminating the dead *2_init variables. clang optimizes both cases equally, though.

Here follows gcc output (-O0):

    .file   "defs.c"
    .local  foo1_init
    .comm   foo1_init,8,8
    .section    .rodata
    .align 8
    .type   foo2_init, @object
    .size   foo2_init, 8
foo2_init:
    .zero   8
    .local  bar1_init
    .comm   bar1_init,80,32
    .align 32
    .type   bar2_init, @object
    .size   bar2_init, 80
bar2_init:
    .zero   80
    .text
    .globl  init_foo1
    .type   init_foo1, @function
init_foo1:
.LFB0:
    .cfi_startproc
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6
    movq    foo1_init(%rip), %rax
    movq    %rax, foo1(%rip)
    nop
    popq    %rbp
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc
.LFE0:
    .size   init_foo1, .-init_foo1
    .globl  init_foo2
    .type   init_foo2, @function
init_foo2:
.LFB1:
    .cfi_startproc
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6
    movq    $0, foo2(%rip)
    nop
    popq    %rbp
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc
.LFE1:
    .size   init_foo2, .-init_foo2
    .globl  init_bar1
    .type   init_bar1, @function
init_bar1:
.LFB2:
    .cfi_startproc
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6
    movq    bar1_init(%rip), %rax
    movq    %rax, bar1(%rip)
    movq    bar1_init+8(%rip), %rax
    movq    %rax, bar1+8(%rip)
    movq    bar1_init+16(%rip), %rax
    movq    %rax, bar1+16(%rip)
    movq    bar1_init+24(%rip), %rax
    movq    %rax, bar1+24(%rip)
    movq    bar1_init+32(%rip), %rax
    movq    %rax, bar1+32(%rip)
    movq    bar1_init+40(%rip), %rax
    movq    %rax, bar1+40(%rip)
    movq    bar1_init+48(%rip), %rax
    movq    %rax, bar1+48(%rip)
    movq    bar1_init+56(%rip), %rax
    movq    %rax, bar1+56(%rip)
    movq    bar1_init+64(%rip), %rax
    movq    %rax, bar1+64(%rip)
    movq    bar1_init+72(%rip), %rax
    movq    %rax, bar1+72(%rip)
    nop
    popq    %rbp
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc
.LFE2:
    .size   init_bar1, .-init_bar1
    .globl  init_bar2
    .type   init_bar2, @function
init_bar2:
.LFB3:
    .cfi_startproc
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6
    movl    $bar2, %eax
    movl    $80, %ecx
    movl    $0, %esi
    movq    %rsi, (%rax)
    movl    %ecx, %edx
    addq    %rax, %rdx
    addq    $8, %rdx
    movq    %rsi, -16(%rdx)
    leaq    8(%rax), %rdx
    andq    $-8, %rdx
    subq    %rdx, %rax
    addl    %eax, %ecx
    andl    $-8, %ecx
    movl    %ecx, %eax
    shrl    $3, %eax
    movl    %eax, %ecx
    movq    %rdx, %rdi
    movq    %rsi, %rax
    rep stosq
    nop
    popq    %rbp
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc
.LFE3:
    .size   init_bar2, .-init_bar2
    .ident  "GCC: (GNU) 6.3.1 20170306"
    .section    .note.GNU-stack,"",@progbits

GCC does not optimise a struct copy of uninitialised static const

1 Answers1