12

Say I have a function where depending on some runtime condition an expensive automatic object is created or a cheap automatic object is created:

void foo() {
   if (runtimeCondition) {
       int x = 0;
   } else {
       SuperLargeObject y;
   }
}

When the compiler allocates memory for a stack frame for this function, will it just allocate enough memory to store the SuperLargeObject, and if the condition leading to the int is true that extra memory will just be unused? Or will it allocate memory some other way?

genpfault
  • 51,148
  • 11
  • 85
  • 139
TwistedBlizzard
  • 931
  • 2
  • 9
  • 10
    It depends on your compiler and probably on the optimization settings. In debug builds most C++ compilers will allocate stack memory for both objects and use one or the other based on which branch is taken. In optimized builds things get more complicated – chrysante May 19 '23 at 18:02
  • 4
    I *think* in practice they allocate memory for the biggest branch. – user253751 May 19 '23 at 18:03
  • 3
    The stack is known to be a limited resource, so why put `SuperLargeObject`s on there in the 1st place? – Richard Critten May 19 '23 at 18:09
  • My understanding is that the compiler will only allocate memory for variable when execution encounters the variable. – Thomas Matthews May 19 '23 at 18:12
  • @RichardCritten I just meant to say if you had two variables declared in two different branches, where one variable is larger than the other. It could be like an int and a long. – TwistedBlizzard May 19 '23 at 18:13
  • 2
    @ThomasMatthews it will only construct them when it encounters them but it could adjust the sp register on entry to make enough room for the maximum space required. – Richard Critten May 19 '23 at 18:13
  • See the single `sub rsp, 40016` here - live - https://godbolt.org/z/c34aYYEKP - if you turn on optimization (add -O2) it all changes as it's hard to construct a trivial example that does not get optimized away. – Richard Critten May 19 '23 at 18:17
  • @RichardCritten At least GCC's front-end and Clang always place `alloca` instructions in the entry basic block for each local variable, to avoid calling `alloca` in a loop and potentially smashing the stack. – chrysante May 19 '23 at 18:41
  • 2
    @RichardCritten My trick for this is to add a `extern void consume(void*);` and call it with the address of anything you don't want to be optimized out. https://godbolt.org/z/9Gr5549vq – Raymond Chen May 19 '23 at 19:22

2 Answers2

12

It depends on your compiler and on the optimization settings. In unoptimized builds most C++ compilers will probably allocate stack memory for both objects and use one or the other based on which branch is taken. In optimized builds things get more interesting:

If both objects (the int and the SuperLargeObject are not used and the compiler can prove that constructing SuperLargeObject does not have side effects, both allocations will be elided.

If the objects escape the function, i.e. their addresses are passed to another function, the compiler has to provide memory for them. But since their lifetimes don't overlap, they can be stored in overlapping memory regions. It is up to the compiler if that actually happens or not.

As you can see here, different compilers generate different assembly for these two functions: (Modified example from OP and reference, all compiled for x86-64)

void escape(void const*);

struct SuperLargeObject {
    char data[104];
};

void f(bool cond) {
    if (cond) {
        int x;
        escape(&x);
    }
    else {
        SuperLargeObject y;
        escape(&y);
    }
}

void g() {
    SuperLargeObject y;
    escape(&y);
}

Note that all stack allocations are odd multiples of 8, because the x86-64 ABI mandates the stack pointer to be 16 byte aligned, and 8 bytes are pushed by the call instruction for the return address (Thanks to @PeterCordes for explaining this to me on another post).

ICC

f(bool):
        sub       rsp, 120
        test      dil, dil
        lea       rax, QWORD PTR [104+rsp]
        lea       rdx, QWORD PTR [rsp]
        cmovne    rdx, rax
        mov       rdi, rdx
        call      escape(void const*)
        add       rsp, 120
        ret
g():
        sub       rsp, 104
        lea       rdi, QWORD PTR [rsp]
        call      escape(void const*)
        add       rsp, 104
        ret

ICC seems to allocate enough memory two store both objects and then selects between the two non-overlapping regions based on the runtime condition (using cmov) and passes the selected pointer to the escaping function.

In the reference function g it only allocates 104 bytes, exactly the size of SuperBigObject.

GCC

f(bool):
        sub     rsp, 120
        mov     rdi, rsp
        call    escape(void const*)
        add     rsp, 120
        ret
g():
        sub     rsp, 120
        mov     rdi, rsp
        call    escape(void const*)
        add     rsp, 120
        ret

GCC also allocates 120 bytes, but it places both objects at the same address and thus emits no cmov instruction.

Clang

f(bool):
        sub     rsp, 104
        test    edi, edi
        mov     rdi, rsp
        call    escape(void const*)@PLT
        add     rsp, 104
        ret
g():
        sub     rsp, 104
        mov     rdi, rsp
        call    escape(void const*)@PLT
        add     rsp, 104
        ret

Clang also merges the two allocations and also reduces the allocation size to the necessary 104 bytes.

Unfortunately I don't understand why it tests the condition in function f.


You should also note, that when the compiler can place either or both of the variables in registers, no memory will be allocated at all, even when they are used and reassigned throughout the function. For int's and long's and other small objects that is most often the case, if their addresses to not escape the function.

chrysante
  • 2,328
  • 4
  • 24
  • 4
    Interestingly GCC merges the allocations, still allocates 120 bytes, but does not use that extra space to align the array on a 16 byte boundary. I'd be glad if anyone could explain that behaviour. – chrysante May 19 '23 at 21:23
4

You should assume that all the memory declared anywhere in the function is all allocated at once on entry into the function. Decent C compilers merge the storage for objects that have non-overlapping lifetimes.

If you have a large object in some particular code path that you would like to avoid allocating when that path is not taken, you have to do one of these:

  • Allocate it dynamically and then free it, in that code path.

  • Allocate it using a C99 variable-length array.

  • Allocate it using the alloca function/operator that exists as a traditional extension in many compilers.

  • Move the code into separate helper function. However, if that function is inlined, this will not make a difference! The stack allocations coming from inlined functions are incorporated into one big stack frame as if the code were written inline. Be sure you use the compiler-specific magic to declare that this function not be inlined.

#ifdef __GNUC__
#define NOINLINE __attribute__((noinline))
#else
#error port me
#endif

NOINLINE void foo_LargeObjectCase()
{
   SuperLargeObject y;
}

void foo() {
   if (runtimeCondition) {
       int x = 0;
   } else {
       foo_SuperLargeObjectCase();
   }
}

I have used the last approach above in the TXR Lisp virtual machine. The VM dispatches functions for various instructions. Some of the functions have more stack storage, some less. I have declared many of those functions notinline, and this made a huge difference to the observed stack frame size.

Of course, moving the code into a function may be inconvenient; you may have to pass down all the arguments, and even some additional ones if the code needs access to some local variables of foo.

If you're concerned about the stack usage of functions, gcc has useful diagnostics for that. You can use -fstack-usage to obtain information about the stack usage of functions, and/or the -Wstack-usage=N warning which is issued if the stack usage of some function exceeds N bytes.

True story: -fstack-usage helped me discover that a function which uses the crypt_r function in the GNU C library had a stack frame over 128 kilobytes. That's how large is the struct crypt_data context buffer for crypt_r! I switched that code to malloc and free the structure.

Kaz
  • 55,781
  • 9
  • 100
  • 149