Why use .data instead of reserving space in .bss and initializing at runtime, for variables in assembly/C?

Question

First of all: I know that there are a lot of web pages (including discussion on stackoverflow) where the differences between .bss and .data for the data declaration is discussed, but I have a specific question and I did not find the answer on these pages unfortunately, so I ask it here :-).

I am a big beginner in assembly, so I apologize if the question is stupid :-).

I am learning assembly on a x86 64-bit linux os (but I think that my question is more general and probably not specific to the os/the arcthitecture).

I find the definition of the .bss and .data sections a bit strange. I can always declare a variable in .bss and then move a value in this variable in my code (.text section), right ? So why should I declare a variable in the .data section, If I know that variables declared in this section will increase the size of my executable file ?

I could ask this question in the context of C programming as well: why should I initialize my variable when I declare it is more efficient to declare it uninitialized and then assign a value to it in the beginning of my code ?

I suppose that my approach of memory management is naive and not correct, but I do not understand why.

Trivia - bss stands for block started by symbol. Some assembler also have bes for block ended by symbol (this would make sense for stack type usage of memory), — rcgldr, Feb 19 '19 at 13:07

Peter Cordes · Accepted Answer · 2022-03-13T23:12:39.870

.bss is where you put zero-initialized static data, like C int x; (at global scope). That's the same as int x = 0; for static / global (static storage class)¹.

.data is where you put non-zero-initialized static data, like int x = 2; If you put that in BSS, you'd need a runtime static "constructor" to initalize the BSS location. Like what a C++ compiler would do for static const int prog_starttime = __rdtsc();. (Even though it's const, the initializer isn't a compile-time constant so it can't go in .rodata)

.bss with a runtime initializer would make sense for big arrays that are mostly zero or filled with the same value (memset / rep stosd), but in practice writing char buf[1024000] = {1}; will put 1MB of almost all zeros into .data, with current compilers.

Otherwise it is not more efficient. A mov dword [myvar], imm32 instruction is at least 8 bytes long, costing about twice as many bytes in your executable as if it were statically initialized in .data. Also, the initializer has to be executed.

By contrast, section .rodata (or .rdata on Windows) is where compilers put string literals, FP constants, and static const int x = 123; (Actually, x would normally get inlined as an immediate everywhere it's used in the compilation unit, letting the compiler optimize away any static storage. But if you took its address and passed &x to a function, the compiler would need it to exist in memory somewhere, and that would be in .rodata)

Footnote 1: Inside a function, int x; would be on the stack if the compiler didn't optimize it away or into registers, when compiling for a normal register machine with a stack like x86.

I could ask this question in the context of C programming as well

In C, an optimizing compiler will treat int x; x=5; pretty much identically to int x=5; inside a function. No static storage is involved. Looking at actual compiler output is often instructive: see How to remove "noise" from GCC/clang assembly output?.

Outside a function, at global scope, you can't write things like x=5;. You could do that at the top of main, and then you would trick the compiler into making worse code.

Inside a function with static int x = 5;, the initialization happens once. (At compile time). If you did static int x; x=5; the static storage would be re-initialized every time the function was entered, and you might as well have not used static unless you have other reasons for needing static storage class. (e.g. returning a pointer to x that's still valid after the function returns.)

Ok ! Thank you for your detailed answer. It is clearer for me now. I will try to think about what you said and post a new question if needed :-). — Louis, Feb 19 '19 at 10:06

score 1 · Answer 2 · edited Feb 19 '19 at 16:19

The size of an instruction that writes an immediate operand (i.e., a compile-time constant) into a memory location is necessarily larger than the size of the constant itself. If all of the constants are different values, then you need to use different instructions for different values and the total size of these instructions would be larger than the total size of the values. In addition, there will be a run-time performance overhead to execute these instructions. If the constants are the same, then a loop can be used to initialize all the corresponding variables. The loop itself would be indeed much smaller than the total size of the constants. In this case, instead of allocating many static variables to hold the same constant, you can use something like malloc followed by a loop to initialize the allocated region. This can significantly reduce the size of an object file and improve performance.

Consider an OS that keeps a number of pages initialized to some constant or different pages might be initialized to different constants. These pages can be prepared by the OS in a background thread. When a program requests a page that is initialized to a particular constant, the OS can simply maps one of the pages that it has already initialized to its page table, thereby avoiding the need to execute a loop at run-time. In fact, the Windows OS always initializes all reclaimed pages to a constant value of all-bits-zero. This is both a security feature and performance enhancement feature.

Static variables are typically either not initialized at compile-time or initialized to zero. Some languages, such as C and C++, require the runtime to initialize uninitialized static variables to zero. What is the most efficient way to initialize pages to zero? The C runtime could for example emit a sequence of instructions or a loop in the entry point of an object file to initialize all uninitialized static variables to the specified compile-time constants. But then every object file would require these instructions. It is more efficient space-wise to delegate the OS to do this initialization on-demand (on Linux) or proactively (on Windows).

The ELF executable format defines the bss section as the portion of the object file that contains uninitialized variables. Therefore, the bss section needs to only specify the total size of all variables, in contrast to the data section which needs to also specify the values of each variable. There is no requirement that the OS should initialize (or not) the bss section to zero or any other value, but typically this is indeed the case. In addition, although C/C++ requires the runtime to initialize all static variables that are not explicitly initialized to zero/null, the language standard does not define a particular bit pattern for zero/null. Only when the language implementation and the bss implementation match can uninitialized static variable be allocated in the bss section.

When Linux loads an ELF binary, it maps the bss section to a dedicated zero page marked as copy-on-write (see: How exactly does copy on write work). So there is no overhead to initialize that page to zero. In some cases, bss may occupy a fraction of a page (See for example Gnu assembler .data section value corrupted after syscall). In this case, that fraction is explicitly initialized to all-bits-zero using a movb/incq/decl/jnz loop.

A hypothetical OS can for example initialize each byte of the bss section to 0000_0001b. Also in a hypothetical implementation of C, the NULL-pointer bit pattern may be (multiple bytes of) 0000_0010b. In this case, default-initialized static pointer variables and arrays can be allocated in the bss section without any init loop inside the C program. But any other values, such as integer arrays, will need an init loop unless they happen to be explicitly initialized in the C source to a value that matches the bit-pattern.

(C allows an implementation-defined non-zero object representation for NULL pointers, but integers are more constrained. C rules require static storage-class variables to be implicitly initialized to 0 if not explicitly initialized. And unsigned char is required to be base 2 with no padding. 0 as an initializer for a pointer in the source maps to the NULL bit pattern, unlike using memcpy of unsigned char zeros into the object representation.)

Visual Studio / Microsoft 32/64 bit tool sets zero out .bss variables. Microsoft 16 bit tool sets leave .bss space uninitialized, and variables in .data declared with ?, such as "mydata db 20 dup (?)", will just pick up whatever is in memory at assembly time, sometimes you'll get bits of source code. — rcgldr, Feb 19 '19 at 13:01
@rcgldr Variables go into .data when they are explicitly initialized, so how can they left uninitialized by the 16-bit toolset. On all versions of Windows at least since XP, there is a zero page thread that zeroes out pages, and it's impossible to allocate a page from user land that is not zero-initialized whether it is for .bss, .data, or any other purpose. — Hadi Brais, Feb 19 '19 at 13:50
Part of my comment was a reference to Microsoft 16 bit tool set, typically used with MSDOS 6.22. .bss is not initialized. Also as I previously commented, a data value declared as "?" is supposed to be uninitialized, normally used for .bss, but in .data, for "?", MASM (5.x, 6.x aka ML.EXE) just grabs whatever happens to be in memory at assembly time. — rcgldr, Feb 19 '19 at 14:31
*in a hypothetical implementation of C, the zero/null bit pattern may be 0000_0010b*. That's not practical. The `NULL` *pointer* representation is implementation-defined, but `unsigned char` is required to be normal base 2 with no padding. (i.e. `2^CHAR_BIT` possible values). If the underlying bit-pattern for `0` isn't actually `0`, almost everything has to be emulated, including memcpy of object representations. — Peter Cordes, Feb 19 '19 at 15:01
I think Linux only uses `__clear_user` with that simplistic movq loop for small BSSes that end up in the same page as the .data segment. We know that *large* BSS arrays have all their pages copy-on-write mapped to the same physical zero page, so they definitely weren't all just written with zeros to each virtual page separately. I'm surprised there isn't a `rep movsb` alternative for `__clear_user`. vs. a `movq` loop, the break even point might be something like 128 or 256 bytes, and it avoids branch misses for cleanup. — Peter Cordes, Feb 19 '19 at 15:06
@PeterCordes Oh yes, forgot about that. The check at line 917 seems to to be checking whether the base address of bss is page-aligned and if it is not, only then it calls that function to zero out that fraction of the page. — Hadi Brais, Feb 19 '19 at 15:16
@PeterCordes I think that when a program attempts to write to that zero page, the OS must allocate a physical page and initialize to zero, then allow the write to be re-attempted. We can probably find this code in the page fault handling logic. This is in contrast to Windows where the OS just grabs an already zeroed page when writing to the zero page (I assume there is a zero page on Windows too). — Hadi Brais, Feb 19 '19 at 15:26
Yes, I think that's how Linux COW works. I think it used to keep a pool of zeroed pages around, with a kernel background thread, but on the fly zeroing makes some sense by priming caches for that page. (Unless it uses NT stores?) Or at least opening the containing DRAM page (not the same thing as virtual-memory page) possibly making future writes slightly lower latency. — Peter Cordes, Feb 19 '19 at 15:30
An ELF image with small `.data` and `.bss` can and does put them both in the same page, contiguous with each other. I think this is a linker decision that ELF program headers allow it to implement. [Gnu assembler .data section value corrupted after syscall](//stackoverflow.com/a/50584542) shows `.data` and `.bss` in the same ELF segment, and having the program header say to only load 1 byte into that page instead of private mapping (leaving the rest zeroed as the BSS). (Took a while to find search terms that found that answer I remembered writing :P) — Peter Cordes, Feb 19 '19 at 15:48

score 1 · Answer 3 · edited Mar 13 '22 at 22:31

I'll do this by way of example, and ARM despite the x86 tag, easier to read, etc - functionally the same.

bootstrap

.globl _start
_start:
    ldr r0,=__bss_start__
    ldr r1,=__bss_end__
    mov r2,#0
bss_fill:
    cmp r0,r1
    beq bss_fill_done
    strb r2,[r0],#1
    b bss_fill
bss_fill_done:
    /* data copy would go here */
    bl main
    b .

This code might be buggy, definitely inefficient, but here for demonstration purposes.

C code

unsigned int ba;
unsigned int bb;
unsigned int da=5;
unsigned int db=0x12345678;
int main ( void )
{
    ba=5;
    bb=0x88776655;
    return(0);
}

I could use assembly as well, but .bss, .data, etc don't make as much sense in asm as they do in compiled code.

MEMORY
{
    rom : ORIGIN = 0x08000000, LENGTH = 0x1000
    ram : ORIGIN = 0x20000000, LENGTH = 0x1000
}
SECTIONS
{
    .text : { *(.text*) } > rom
    .rodata : { *(.rodata*) } > ram
    __bss_start__ = .;
    .bss : { *(.bss*) } > ram
    __bss_end__ = .;
    __data_start__ = .;
    .data : { *(.data*) } > ram
    __data_end__ = .;
}

Linker script used.

Result:

Disassembly of section .text:

08000000 <_start>:
 8000000:   e59f001c    ldr r0, [pc, #28]   ; 8000024 <bss_fill_done+0x8>
 8000004:   e59f101c    ldr r1, [pc, #28]   ; 8000028 <bss_fill_done+0xc>
 8000008:   e3a02000    mov r2, #0

0800000c <bss_fill>:
 800000c:   e1500001    cmp r0, r1
 8000010:   0a000001    beq 800001c <bss_fill_done>
 8000014:   e4c02001    strb    r2, [r0], #1
 8000018:   eafffffb    b   800000c <bss_fill>

0800001c <bss_fill_done>:
 800001c:   eb000002    bl  800002c <main>
 8000020:   eafffffe    b   8000020 <bss_fill_done+0x4>
 8000024:   08000058    stmdaeq r0, {r3, r4, r6}
 8000028:   20000008    andcs   r0, r0, r8

0800002c <main>:
 800002c:   e3a00005    mov r0, #5
 8000030:   e59f1014    ldr r1, [pc, #20]   ; 800004c <main+0x20>
 8000034:   e59f3014    ldr r3, [pc, #20]   ; 8000050 <main+0x24>
 8000038:   e59f2014    ldr r2, [pc, #20]   ; 8000054 <main+0x28>
 800003c:   e5810000    str r0, [r1]
 8000040:   e5832000    str r2, [r3]
 8000044:   e3a00000    mov r0, #0
 8000048:   e12fff1e    bx  lr
 800004c:   20000004    andcs   r0, r0, r4
 8000050:   20000000    andcs   r0, r0, r0
 8000054:   88776655    ldmdahi r7!, {r0, r2, r4, r6, r9, r10, sp, lr}^

Disassembly of section .bss:

20000000 <bb>:
20000000:   00000000    andeq   r0, r0, r0

20000004 <ba>:
20000004:   00000000    andeq   r0, r0, r0

Disassembly of section .data:

20000008 <db>:
20000008:   12345678    eorsne  r5, r4, #120, 12    ; 0x7800000

2000000c <da>:
2000000c:   00000005    andeq   r0, r0, r5

Clearly at the end you see the storage for the four variables and they are .bss and .data as expected.

but here is the difference that folks are trying to explain.

There should be code to zero the .bss and that is a waste of cycles yes, and some compilers are starting to warn about using uninitialized variables, and that is good, but either way .bss has some code to zero. .data also might have some code to copy I didn't complete this example to show how that works, you tell the linker script that the .data is in ram but put a copy in rom and have both addresses and sizes/ends the rom data start and a ram data start and you copy from rom to ram.

So the difference in cost of .data vs .bss is for .data you have memory allocated and either through the operating system loader or your own boot strap that data might need to be copied an additional time, might not.

20000008 <db>:
20000008:   12345678

for .bss

20000000 <bb>:
20000000:   00000000    andeq   r0, r0, r0

Again the os loader and/or how you build (in this case putting .data after .bss and having at least one .data item if you were to objcopy -O binary this you would get zeroed data in the .bin and not need to fill that .bss data, depends on the loader and destination).

So the storage is equal, but the extra cost for .bss is

 800002c:   e3a00005    mov r0, #5
 8000030:   e59f1014    ldr r1, [pc, #20]   ; 800004c <main+0x20>

 800003c:   e5810000    str r0, [r1]

 800004c:   20000004

and

 8000034:   e59f3014    ldr r3, [pc, #20]   ; 8000050 <main+0x24>
 8000038:   e59f2014    ldr r2, [pc, #20]   ; 8000054 <main+0x28>

 8000040:   e5832000    str r2, [r3]

 8000050:   20000000
 8000054:   88776655

the first one requires an instruction to put the 5 in a register, an instruction to get the address and a memory cycle to store 5 in memory. The second is more costly as it takes an instruction with a memory cycle to get the data then one to get the address then the store, all of them being memory cycles.

Another answer here has tried to argue that you don't have a static cost because they are immediates but the thing about variable length instruction sets is those immediates are there and are read from memory just like fixed length, its not a separate memory cycle it is part of the prefetching but it is still static storage. The difference is you have at least one memory cycle to store the value in memory (.bss and .data imply global so the store to memory is required). Because these are linked the address to the variables needs to be put in place by the linker, in this case with a fixed length risc instruction set that is a pool nearby, for cisc like x86 that would be embedded in a mov immediate to register, either way static storage for the address and static storage for the value, x86 vs arm the x86 would use fewer bytes of instructions to perform the task in two instructions, arm three instructions three separate memory cycles. Functionally the same.

Now where this can save you, by violating expectations but being in complete control (bare metal).

.globl _start
_start:
    ldr sp,=0x20002000
    bl main
    b .



unsigned int ba;
unsigned int bb;
int main ( void )
{
    ba=5;
    bb=0x88776655;
    return(0);
}
MEMORY
{
    rom : ORIGIN = 0x08000000, LENGTH = 0x1000
    ram : ORIGIN = 0x20000000, LENGTH = 0x1000
}
SECTIONS
{
    .text : { *(.text*) } > rom
    .rodata : { *(.rodata*) } > ram
    .bss : { *(.bss*) } > ram
}


Disassembly of section .text:

08000000 <_start>:
 8000000:   e59fd004    ldr sp, [pc, #4]    ; 800000c <_start+0xc>
 8000004:   eb000001    bl  8000010 <main>
 8000008:   eafffffe    b   8000008 <_start+0x8>
 800000c:   20002000    andcs   r2, r0, r0

08000010 <main>:
 8000010:   e3a00005    mov r0, #5
 8000014:   e59f1014    ldr r1, [pc, #20]   ; 8000030 <main+0x20>
 8000018:   e59f3014    ldr r3, [pc, #20]   ; 8000034 <main+0x24>
 800001c:   e59f2014    ldr r2, [pc, #20]   ; 8000038 <main+0x28>
 8000020:   e5810000    str r0, [r1]
 8000024:   e5832000    str r2, [r3]
 8000028:   e3a00000    mov r0, #0
 800002c:   e12fff1e    bx  lr
 8000030:   20000004    andcs   r0, r0, r4
 8000034:   20000000    andcs   r0, r0, r0
 8000038:   88776655    ldmdahi r7!, {r0, r2, r4, r6, r9, r10, sp, lr}^

Disassembly of section .bss:

20000000 <bb>:
20000000:   00000000    andeq   r0, r0, r0

20000004 <ba>:
20000004:   00000000    andeq   r0, r0, r0

(I think I deleted the stack init in the prior example)

There was no need to complicate the (toolchain specific) linker script, no need to initialize any of the memory in the bootstrap, instead init the variables in the code, it is more costly as far as .text space goes, but easier to write and maintain. easier to port if the need arises, etc. But breaks known rules/assumptions if someone wants to take that code and add a .data item or assume a .bss item is zeroed.

Another shortcut, say Raspberry Pi bare metal:

.globl _start
_start:
    ldr sp,=0x8000
    bl main
    b .

unsigned int ba;
unsigned int bb;
unsigned int da=5;
int main ( void )
{
    return(0);
}

MEMORY
{
    ram : ORIGIN = 0x00008000, LENGTH = 0x1000
}
SECTIONS
{
    .text : { *(.text*) } > ram
    .rodata : { *(.rodata*) } > ram
    .bss : { *(.bss*) } > ram
    .data : { *(.data*) } > ram
}



Disassembly of section .text:

00008000 <_start>:
    8000:   e3a0d902    mov sp, #32768  ; 0x8000
    8004:   eb000000    bl  800c <main>
    8008:   eafffffe    b   8008 <_start+0x8>

0000800c <main>:
    800c:   e3a00000    mov r0, #0
    8010:   e12fff1e    bx  lr

Disassembly of section .bss:

00008014 <bb>:
    8014:   00000000    andeq   r0, r0, r0

00008018 <ba>:
    8018:   00000000    andeq   r0, r0, r0

Disassembly of section .data:

0000801c <da>:
    801c:   00000005    andeq   r0, r0, r5

hexdump -C so.bin
00000000  02 d9 a0 e3 00 00 00 eb  fe ff ff ea 00 00 a0 e3  |................|
00000010  1e ff 2f e1 00 00 00 00  00 00 00 00 05 00 00 00  |../.............|
00000020

the existence of a .data item and .data being defined after .bss in the linker script and the binary is copied by the GPU into ram for us as a whole .text,.bss,.data, etc. the zeroing of .bss was a freebie we didnt need to add additional code for .bss nor if we have more .data and are using it we got a free init/copy of .data as well.

These are corner cases, but do demonstrate the kinds of things you were thinking about why zero a variable that I can just change or will end up changing in .text later. Which I extend to why burn boot time zeroing that section in the first place, why complicate the linker script, gnu linker scripts are nasty and painful at best, have to be very careful to get them right, granted once you get them right then not too much work each rev of toolchain items to see if it still work.

To do it correctly, .bss costs you instructions and execution time of those instructions including the separate memory bus cycle(s). But there should be linker script and bootstrap there code no matter what for .bss. Likewise for .data but unless rom/flash based it is likely that the source and destination for .data is the same the copy happened in the loader (operating system copying the binary from rom/flash/disk to memory) and doesn't need an additional copy unless you force it in the linker script.

Well based on comments in other questions "correctly" lets say based on assumptions, the .data items need to show up as defined in the compiled code, what you find for .bss has historically been toolchain specific, what the spec says I would have to look up and what version for what toolchain you might end up using as despite popular belief not all toolchains that are in use today are in constant maintenance to comply with the standard that is in place this second. Some folks have the luxury of limiting their projects to those that have up to date tools, many don't.

The shortcuts shown here are similar to hand tuned assembly vs just taking what the compiler provides, you are on your own and it can be risky if you are not careful, but you can get a decent performance gain on boot doing something like that, if that is something desired/required for your project. Would not use anything like that for non-specialized work.

Also note you are well into the don't use global variables religious debate with this discussion as well. If you don't use globals then you still deal with local globals as I call them, or in other words local static variables which fall into this category.

unsigned int more_fun ( unsigned int, unsigned int );
void fun ( unsigned int x )
{
    static int ba;
    static int da=0x12345678;
    ba+=x;
    da=more_fun(ba,da);
}
int main ( void )
{
    return(0);
}

0000800c <fun>:
    800c:   e59f2028    ldr r2, [pc, #40]   ; 803c <fun+0x30>
    8010:   e5923000    ldr r3, [r2]
    8014:   e92d4010    push    {r4, lr}
    8018:   e59f4020    ldr r4, [pc, #32]   ; 8040 <fun+0x34>
    801c:   e0803003    add r3, r0, r3
    8020:   e5941000    ldr r1, [r4]
    8024:   e1a00003    mov r0, r3
    8028:   e5823000    str r3, [r2]
    802c:   ebfffff6    bl  800c <fun>
    8030:   e5840000    str r0, [r4]
    8034:   e8bd4010    pop {r4, lr}
    8038:   e12fff1e    bx  lr
    803c:   0000804c    andeq   r8, r0, r12, asr #32
    8040:   00008050    andeq   r8, r0, r0, asr r0

00008044 <main>:
    8044:   e3a00000    mov r0, #0
    8048:   e12fff1e    bx  lr

Disassembly of section .bss:

0000804c <ba.3666>:
    804c:   00000000    andeq   r0, r0, r0

Disassembly of section .data:

00008050 <da.3667>:
    8050:   12345678    eorsne  r5, r4, #120, 12    ; 0x7800000

Being local static or local globals they still land in .data or .bss.

Why use .data instead of reserving space in .bss and initializing at runtime, for variables in assembly/C?

3 Answers3