where is it documented that global array in C, compiled by gcc, is initialized like "copy-on-write"?

Question

For this C code:

foobar.c:

static int array[256];

int main() {
    return 0;
}

the array is initialized to all 0's, by the C standard. However, when I compile

gcc -S foobar.c

this produces the assembly code foobar.s that I can inspect, and nowhere in foobar.s, is there any initialization of contents of the array.

Hence I reason, that the contents are not initialized, only when an element of the array is inspected, is it initialized, kind of like "copy-on-write" mechanism for fork.

Is my reasoning correct? If so, is this a documented feature, and if so where can I find that documentation?

It will use `.space` or equivalent in the `.bss` section. Yes, modern OSes will lazily map BSS pages; read-only access will result in them being CoW mapped to a physical page of zeros, but write access will result in a fresh page of zeros actually being allocated. Not actually copy-on-write, since it's not copying anything. — Peter Cordes, Feb 04 '23 at 03:03
@PeterCordes is this what you know from experience, or is there documentation for that I can read? — Mark Galeck, Feb 04 '23 at 03:04
It's what I know from experience, e.g. from looking at `/proc//smaps` on Linux, and performance experiments where you can get TLB misses but L2 cache hits reading through large arrays in the BSS or from calloc. GCC documentation might say something about how or when it will use the BSS, since apparently there's a `-fno-zero-initialized-in-bss` GCC option ([gcc: put all static/writable variables in the .data section](https://stackoverflow.com/q/48837094)). How an OS implements a zero-initialized BSS is an implementation detail; Linux might document it somewhere. — Peter Cordes, Feb 04 '23 at 03:08
It's not *necessary* for a BSS to be lazy at all; it could just allocate zeroed physical pages to back the virtual pages of the BSS. That's a separate consideration from not storing the zeros in the executable itself (in the `.data` section like you'd get with `-fno-zero-initialized-in-bss`.) — Peter Cordes, Feb 04 '23 at 03:09
Re: experimental evidence of lazy BSS: [Why is the second loop over a static array in the BSS faster than the first?](https://stackoverflow.com/q/24376732) - (big enough array that it's not still hot in cache; only page faults explain it.) Also [Why is iterating though \`std::vector\` faster than iterating though \`std::array\`?](https://stackoverflow.com/q/57125253) / [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987) — Peter Cordes, Feb 04 '23 at 03:15
GCC docs: https://gcc.gnu.org/onlinedocs/gcc/Common-Variable-Attributes.html#index-section-variable-attribute - *Normally, the compiler places the objects it generates in sections like data and bss.* (and you can override that with `__attribute__(section ("section-name")))`.) And https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#index-fzero-initialized-in-bss - *If the target supports a BSS section, GCC by default puts variables that are initialized to zero into BSS. This can save space in the resulting code.* Like I said, it's not up to GCC *how* the BSS is implemented (e.g. lazily) — Peter Cordes, Feb 04 '23 at 03:16
[Can static memory be lazily allocated?](https://stackoverflow.com/q/50901873) links some Linux stuff, but not official docs. You might have some look looking in https://docs.kernel.org/ if you're interested in Linux specifically, but a search on "bss" in those docs only finds some mention of the term. It's sort of too well known and widely assumed behaviour to be documented, or there isn't a tuning setting that controls it. — Peter Cordes, Feb 04 '23 at 03:22
You want to learn about the ELF file structure and segments, probably. Use a tool like objdump to display the structure of your program file - the information that tells the OS how to load it into memory. It probably won't have 256 0's in it, just a memory segment saying to load a bunch of 0's. — user253751, Feb 04 '23 at 03:53

Nate Eldredge · Accepted Answer · 2023-02-06T02:35:01.853

There's kind of a lot of levels here. This answer addresses Linux in particular, but the same concepts are likely to apply on other systems, possibly with different names.

The compiler requires that the object be "zero initialized". In other words, when a memory read instruction is executed with an address in that range, the value that it reads must be zero. As you say, this is necessary to achieve the behavior dictated by the C standard.

The compiler accomplishes this by asking the assembler to fill the space with zeros, one way or another. It may use the .space or .zero directive which implicitly requests this. It will also place the object in a section with the special name .bss (the reasons for this name are historical). If you look further up in the assembly output, you should see a directive like .bss or .section .bss. The assembler and linker promises that this entire section will be (somehow) initialized to zero. This is documented:

The bss section is used for local common variable storage. You may allocate address space in the bss section, but you may not dictate data to load into it before your program executes. When your program starts running, all the contents of the bss section are zeroed bytes.

Okay, so now what do the assembler and linker do to make it happen? Well, an ELF executable file has a segment header, which specifies how and where code and data from the file should be mapped into the program's memory. (Please note that the use of the word "segment" here has nothing to do with the x86 memory segmentation model or segment registers, and is only vaguely related to the term "segmentation fault".) The size of the segment, and the amount of data to be mapped, are specified separately. If the size is greater, then all remaining bytes are to be initialized to zero. This is also documented in the above-linked man page:

PT_LOAD

The array element specifies a loadable segment, described by p_filesz and p_memsz. The bytes from the file are mapped to the beginning of the memory segment. If the segment's memory size p_memsz is larger than the file size p_filesz, the "extra" bytes are defined to hold the value 0 and to follow the segment's initialized area.

So the linker ensures that the ELF executable contains such a segment, and that all objects in the .bss section are in this segment, but not within the part that is mapped to the file.

Once all this is done, then the observable behavior is guaranteed: as above, when an instruction attempts to read from this object before it has been written, the value it reads will be zero.

Now as to how that behavior is ensured at runtime: that is the job of the kernel. It could do it by pre-allocating actual physical memory for that range of virtual addresses, and filling it with zeros. Or by an "allocate on demand" method, like what you describe, by leaving those pages unmapped in the CPU's page tables. Then any access to those pages by the application will cause a page fault, which will be handled by the kernel, which will allocate zero-filled physical memory at that time, and then restart the faulting instruction. This is completely transparent to the application. It just sees that the read instruction got the value zero. If there was a page fault, then it just seems to the application like the read instruction took a long time to execute.

The kernel normally uses the "on demand" method, because it is more efficient in case not all of the "zero initialized" memory is actually used. But this is not going to be documented as guaranteed behavior; it is an implementation detail. An application programmer need not care, and in fact must not care, how it works under the hood. If the Linux kernel maintainers decide tomorrow to switch everything to the pre-allocate method, every application will work exactly as it did before, just maybe a little faster or slower.

The OP asked where some of this is *documented*. From comments under the question, [the GCC manual entry](https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#index-fzero-initialized-in-bss) for `-fzero-initialized-in-bss` is where GCC documents that it uses the BSS on targets that support one, because that option is on by default. — Peter Cordes, Feb 06 '23 at 02:38
If the first access to an "on demand" page is a load, Linux actually does copy-on-write map it to a system-wide shared physical page of zeros. (So the first write will also fault, resulting in zeroing a page. Perhaps just memset of zeros, not actually memcpy from the zero page, if that's special-cased. It probably is since I think Linux does this even for transparent hugepages.) [Why is iterating though \`std::vector\` faster than iterating though \`std::array\`?](https://stackoverflow.com/q/57125253) — Peter Cordes, Feb 06 '23 at 02:43

where is it documented that global array in C, compiled by gcc, is initialized like "copy-on-write"?

1 Answers1

Linked