How did alloca() interact with other stack allocation?

Question

Let's start from a simple example of stack allocation:

void f() {
    int a, b;
    ...
}

If I understand correctly. Then address of a and b has a fixed offset from stack base, namely the register ebp. That's the how compiler find them if we need them in afterwards.

But consider the following code.

void f(int n) {
    int a;
    alloca(n);
    int b;
    ...
}

If compiler do not do any optimization, the stack would be a->n->b. Now offset of b is dependent on n. Then what did compiler do?

Mimicking How does alloca() work on a memory level? . I tried the following code:

#include <stdio.h>
#include <alloca.h>

void foo(int n)
{
    int a;
    int *b = alloca(n * sizeof(int));
    int c;
    printf("&a=%p, b=%p, &c=%p\n", (void *)&a, (void *)b, (void *)&c);
}

int main()
{
    foo(5);
    return 0;
}

The output is &a=0x7fffbab59d68, b=0x7fffbab59d30, &c=0x7fffbab59d6c. This time address of a and c looks neighboring. Did compiler do some reordering? And if we do not allow compiler to reorder how would compiler find address of c?

------------Some Update------------

Alright, as long as you believe compilers really matters, let's try x86-64 gcc 13.2, I've modify the code a little bit.

#include <alloca.h>
void alloca_test(int n) {
    int a;
    int* ptr = (int *) alloca(n);
    int b;
    a++;
    b++;
    ptr[0]++;
}

and here's the assembly:

alloca_test(int):
        push    rbp
        mov     rbp, rsp
        sub     rsp, 48
        mov     DWORD PTR [rbp-36], edi
        mov     DWORD PTR [rbp-4], 0
        mov     eax, DWORD PTR [rbp-36]
        cdqe
        lea     rdx, [rax+8]
        mov     eax, 16
        sub     rax, 1
        add     rax, rdx
        mov     ecx, 16
        mov     edx, 0
        div     rcx
        imul    rax, rax, 16
        sub     rsp, rax
        mov     rax, rsp
        add     rax, 15
        shr     rax, 4
        sal     rax, 4
        mov     QWORD PTR [rbp-16], rax
        mov     DWORD PTR [rbp-20], 0
        add     DWORD PTR [rbp-4], 1    <--a++
        add     DWORD PTR [rbp-20], 1   <--b++
        mov     rax, QWORD PTR [rbp-16]
        mov     eax, DWORD PTR [rax]
        lea     edx, [rax+1]
        mov     rax, QWORD PTR [rbp-16]
        mov     DWORD PTR [rax], edx
        nop
        leave
        ret

Here b has address [rbp-20], which don't change w.r.t n

A side note: [Why is the use of alloca() not considered good practice?](https://stackoverflow.com/questions/1018853/why-is-the-use-of-alloca-not-considered-good-practice) — wohlstad, Aug 13 '23 at 06:43
The C standard doesn't specify such things. The C standard doesn't even mention the use of a (call) stack. All this depends on the implementation being used (i.e. platform, compiler, compiler options, etc.). So without very specifc information about the system being used, we can't help you. You need to edit the question and add that information. — Support Ukraine, Aug 13 '23 at 06:50
"If compiler do not do any optimization, the stack would be .... " No, the C standard doesn't specify such things. — Support Ukraine, Aug 13 '23 at 06:53
Why do you think that, on systems that use a stack for local variables, they're pushed onto it in the same order they appear in the original source? — Shawn, Aug 13 '23 at 08:36
@Shawn Yes, that's the point. At first sight that looks a natural assumption. Now I've tried to understand the assemblies and looks like the assumption do not holds, compiler scan the code and always first push those 'local variables with fixed size'. But it might introduce more problems, say consider ```if(some_condition){int c=0;...}``` . Then as ```c``` is under those allocated by ```alloca()```, it's not freed after lifetime. — Peter Wu, Aug 13 '23 at 08:49
@PeterWu - Compilers generally allocate all the variables at once, so space for `c` can also be "preallocated" but left unused if not needed. Again, there are no language rules for how to use a stack. Some CPUs don't even have a dedicated stack pointer, but still runs C. — BoP, Aug 13 '23 at 08:59
Note that GCC's debug-mode asm is way overcomplicated, using div and imul instead of `and` with `-16` to round the allocation to a multiple of 16 to maintain stack alignment. See [gcc uses DIV and IMUL by a constant 16, and shifts, for alloca?](https://stackoverflow.com/q/66289556) . If you write a function that won't optimize away an alloca (e.g. pass the pointer to another function, or use `volatile int *ptr = ...` to also force stack space for locals), you can see the much simpler asm from `gcc -O2`. *[How to remove "noise" from GCC asm?](https://stackoverflow.com/q/38552116)* — Peter Cordes, Aug 13 '23 at 17:16

score 6 · Accepted Answer · edited Aug 13 '23 at 17:11

Actually, for a really non-optimizing compiler, it would be the other way around.

Let's imagine a naive, 1970s-tech compiler, since the C language was originally designed to be implemented by compilers like that. And let's ignore the alloca call for the moment. As the compiler parses the function, each time it encounters a definition of a local variable [*], it assigns it a stack slot, relative to the frame pointer ebp: so a at [ebp-4], b at [ebp-8] and so on, and uses that to address the variable. When it finishes parsing, it has seen all the local variables and can compute the total amount of stack needed, and so insert the appropriate constant in the prologue code that adjusts the stack pointer (e.g. sub esp, 8). Even a very simple one-pass compiler that emits code line by line can do this, by emitting something like sub esp, 0 in the prologue and then back-patching later to replace the immediate 0 by the correct value.

Now, as for alloca, it wasn't part of the original C language. Rather, it was basically a "cool hack" that someone discovered, that can be implemented in a way that is completely orthogonal to the process above. We can describe the idea as follows in x86 terms. (The original implementation would have been for PDP-11 or VAX or something like that, but the idea is the same.) Since all the locals are addressed relative to ebp, then it doesn't matter if the stack pointer esp is decremented further during the execution of the function; the compiler never refers to esp. And the stack cleanup in the function epilogue is normally implemented as mov esp, ebp instead of add esp, 8, so it will also continue to work fine.

So in fact the compiler doesn't even have to know that alloca(n) does anything special with regard to the stack. It can be a macro that expands to inline assembly like sub esp, [ebp+8] / mov eax, esp, which, again, can be opaque to the compiler other than filling in an addressing mode for n (in this case the first stack arg, at [ebp+8] assuming traditional use of EBP as a frame pointer). It would play no role at all in the process of allocating stack slots for local variables, since this hypothetical compiler will only ever access locals relative to a frame pointer (EBP), not ESP.

In fact, if you are willing to do a little more stack juggling, alloca can even be an external library function, that the compiler just treats like any other function call. This is why alloca has the syntax of a function call, rather than being integrated more deeply into the language - the original implementation was simply a function call.

So with this implementation, we'd have a and b at the top of the stack frame (allocated simultaneously in the function prologue), and the allocaed buffer below them (allocated at the point of the alloca call). If we had more calls to alloca, then they would return progressively lower addresses in the order they were executed, subtracting from the stack pointer each time.

An alloca that the compiler didn't understand could break things if done in the middle of pushing args for a call to another function, so presumably bar(1, 2, alloca(n), 3) wasn't safe with early hacky implementations.

Now as you might imagine, this "cool hack" didn't work so well once compilers got smarter and wanted to be able to actually control the stack pointer throughout the execution of the function. Then changing the stack pointer behind the compiler's back, as it were, becomes disastrous, so alloca had to be given special compiler support, causing a lot of pain for compiler writers. The idea of alloca was eventually fully adopted into the language when C99 introduced variable-length arrays, but many observers regard that decision as a mistake.

So as for what a modern compiler does, there's not necessarily any single answer. It knows exactly what semantics alloca needs to provide, and it can make its own decisions as to how to implement that. It's not necessarily constrained to using ebp-relative addressing for all locals; it can use esp-relative, or compute the address some other way, or maybe just optimize variables into registers so that they don't occupy a stack slot at all. So we can't easily predict what the stack layout will look like.

[*] Note, in passing, that in assigning stack slots and computing stack usage, the block structure of the function would typically be flattened, and the scopes of the variables would not be taken into account. So even in code like

int a;
if (...) {
    int b;
    ...
}
while (...) {
    int c;
    ...
}

Here a,b,c will each get their own stack slots, say [ebp-4], [ebp-8], [ebp-12] respectively. The stack pointer will be adjusted by 12 in the prologue. In particular, the compiler will not emit instructions to readjust the stack pointer at the beginning and end of each { } blocks. It could, however, optimize by noticing that the scopes of b and c do not overlap, and so it could assign [ebp-8] to both of them, and use only 8 bytes of stack instead of 12.

How did alloca() interact with other stack allocation?

1 Answers1