Alloca implementation

Question

How does one implement alloca() using inline x86 assembler in languages like D, C, and C++? I want to create a slightly modified version of it, but first I need to know how the standard version is implemented. Reading the disassembly from compilers doesn't help because they perform so many optimizations, and I just want the canonical form.

Edit: I guess the hard part is that I want this to have normal function call syntax, i.e. using a naked function or something, make it look like the normal alloca().

Edit # 2: Ah, what the heck, you can assume that we're not omitting the frame pointer.

I think much of the confusion in this regard comes from the fact that `alloca` was first introduced in the very early days of C. Compilers were so naive that they didn't really keep track of the stack and did do everything via frame pointers. So you really could implement `alloca` without compiler cooperation. For instance [here](https://www.tuhs.org/cgi-bin/utree.pl?file=32V/usr/src/libc/sys/alloca.s) is the source code for `alloca` in V32 Unix; just seven lines of VAX assembly that could be called as a normal function call. — Nate Eldredge, Dec 05 '22 at 15:58
And so the idea that "alloca can be implemented independent of the compiler" seems to have stuck. But it's just not true with modern compilers. They *do* keep track of the stack. Even when the frame pointer is not omitted, the compiler may not be using the stack pointer explicitly on every single stack access, but it still knows where it it supposed to be, and certain optimizations may take advantage of that. If you change the stack pointer behind its back, things *will* break. "Maybe not today, maybe not tomorrow, but soon, and for the rest of your life." — Nate Eldredge, Dec 05 '22 at 16:01

Evan Teran · Accepted Answer · 2016-11-19T14:46:19.617

implementing alloca actually requires compiler assistance. A few people here are saying it's as easy as:

sub esp, <size>

which is unfortunately only half of the picture. Yes that would "allocate space on the stack" but there are a couple of gotchas.

if the compiler had emitted code which references other variables relative to esp instead of ebp (typical if you compile with no frame pointer). Then those references need to be adjusted. Even with frame pointers, compilers do this sometimes.
more importantly, by definition, space allocated with alloca must be "freed" when the function exits.

The big one is point #2. Because you need the compiler to emit code to symmetrically add <size> to esp at every exit point of the function.

The most likely case is the compiler offers some intrinsics which allow library writers to ask the compiler for the help needed.

EDIT:

In fact, in glibc (GNU's implementation of libc). The implementation of alloca is simply this:

#ifdef  __GNUC__
# define __alloca(size) __builtin_alloca (size)
#endif /* GCC.  */

EDIT:

after thinking about it, the minimum I believe would be required would be for the compiler to always use a frame pointer in any functions which uses alloca, regardless of optimization settings. This would allow all locals to be referenced through ebp safely and the frame cleanup would be handled by restoring the frame pointer to esp.

EDIT:

So i did some experimenting with things like this:

#include <stdlib.h>
#include <string.h>
#include <stdio.h>

#define __alloca(p, N) \
    do { \
        __asm__ __volatile__( \
        "sub %1, %%esp \n" \
        "mov %%esp, %0  \n" \
         : "=m"(p) \
         : "i"(N) \
         : "esp"); \
    } while(0)

int func() {
    char *p;
    __alloca(p, 100);
    memset(p, 0, 100);
    strcpy(p, "hello world\n");
    printf("%s\n", p);
}

int main() {
    func();
}

which unfortunately does not work correctly. After analyzing the assembly output by gcc. It appears that optimizations get in the way. The problem seems to be that since the compiler's optimizer is entirely unaware of my inline assembly, it has a habit of doing the things in an unexpected order and still referencing things via esp.

Here's the resultant ASM:

8048454: push   ebp
8048455: mov    ebp,esp
8048457: sub    esp,0x28
804845a: sub    esp,0x64                      ; <- this and the line below are our "alloc"
804845d: mov    DWORD PTR [ebp-0x4],esp
8048460: mov    eax,DWORD PTR [ebp-0x4]
8048463: mov    DWORD PTR [esp+0x8],0x64      ; <- whoops! compiler still referencing via esp
804846b: mov    DWORD PTR [esp+0x4],0x0       ; <- whoops! compiler still referencing via esp
8048473: mov    DWORD PTR [esp],eax           ; <- whoops! compiler still referencing via esp           
8048476: call   8048338 <memset@plt>
804847b: mov    eax,DWORD PTR [ebp-0x4]
804847e: mov    DWORD PTR [esp+0x8],0xd       ; <- whoops! compiler still referencing via esp
8048486: mov    DWORD PTR [esp+0x4],0x80485a8 ; <- whoops! compiler still referencing via esp
804848e: mov    DWORD PTR [esp],eax           ; <- whoops! compiler still referencing via esp
8048491: call   8048358 <memcpy@plt>
8048496: mov    eax,DWORD PTR [ebp-0x4]
8048499: mov    DWORD PTR [esp],eax           ; <- whoops! compiler still referencing via esp
804849c: call   8048368 <puts@plt>
80484a1: leave
80484a2: ret

As you can see, it isn't so simple. Unfortunately, I stand by my original assertion that you need compiler assistance.

I think you're ok there; the ESP accesses are writing args before function calls, and ESP-relative is correct. You could try `-fno-accumulate-outgoing-args` or whatever it and related args are to get gcc to just use PUSH instead of using MOV to modify the bottom of the stack. — Peter Cordes, Nov 19 '16 at 14:59
But really, trying to implement alloca behind the compiler's back it a *terrible* idea, like you point out in the early part of this excellent answer. So many ways for it to go wrong, and no reason to do so. If people want to write asm and do their own stack allocation, just write in pure asm instead of abusing inline-asm in C++. — Peter Cordes, Nov 19 '16 at 15:01
@PeterCordes true that most of the ESP references are function arguments, but because it tried to pre-allocate the space **before** the "alloca", those moves will trample on the user's "allocated space". Which is broken if I intend to use that space. Changing those to proper pushes would fix most of that. Also the last esp reference is storing a result in a local variable, and once again will trample on the "array". It goes badly quite quickly. — Evan Teran, Nov 19 '16 at 15:18
Oh good point, yeah I forgot about who owned which space. But `DWORD PTR [esp],eax` is writing an arg for `puts`; I don't see an ESP-relative access to a local. Anyway, I think we agree that the conclusion here is "maybe possible under controlled conditions with a bunch of gcc options that usually hurt performance; totally not worth it and a bad idea". Especially since in x86-64 code, there's no way to tell the compiler you want to clobber the red zone, so this isn't portable at all to x86-64. — Peter Cordes, Nov 19 '16 at 15:34
@PeterCordes, agreed, and good call on the last `DWORD PTR [esp],eax` I mis-read that, it is in fact just setting up an arg for the `puts`. — Evan Teran, Nov 20 '16 at 05:23

score 7 · Answer 2 · answered Apr 03 '09 at 16:54

It would be tricky to do this - in fact, unless you have enough control over the compiler's code generation it cannot be done entirely safely. Your routine would have to manipulate the stack, such that when it returned everything was cleaned, but the stack pointer remained in such a position that the block of memory remained in that place.

The problem is that unless you can inform the compiler that the stack pointer is has been modified across your function call, it may well decide that it can continue to refer to other locals (or whatever) through the stack pointer - but the offsets will be incorrect.

score 5 · Answer 3 · answered Apr 04 '09 at 02:14

5

For the D programming language, the source code for alloca() comes with the download. How it works is fairly well commented. For dmd1, it's in /dmd/src/phobos/internal/alloca.d. For dmd2, it's in /dmd/src/druntime/src/compiler/dmd/alloca.d.

answered Apr 04 '09 at 02:14

Walter Bright

4,277
1
23
28

Well, I guess that pretty much answers it. It says right in the comments that it's a magic function and requires compiler support, i.e. I can't do exactly what I wanted. Maybe I'll figure out a way to do it with the existing alloca() and mixins instead. – dsimcha Apr 04 '09 at 02:55

bk1e · Answer 4 · 2009-04-04T20:05:15.897

The C and C++ standards don't specify that alloca() has to the use the stack, because alloca() isn't in the C or C++ standards (or POSIX for that matter)¹.

A compiler may also implement alloca() using the heap. For example, the ARM RealView (RVCT) compiler's alloca() uses malloc() to allocate the buffer (referenced on their website here), and also causes the compiler to emit code that frees the buffer when the function returns. This doesn't require playing with the stack pointer, but still requires compiler support.

Microsoft Visual C++ has a _malloca() function that uses the heap if there isn't enough room on the stack, but it requires the caller to use _freea(), unlike _alloca(), which does not need/want explicit freeing.

(With C++ destructors at your disposal, you can obviously do the cleanup without compiler support, but you can't declare local variables inside an arbitrary expression so I don't think you could write an alloca() macro that uses RAII. Then again, apparently you can't use alloca() in some expressions (like function parameters) anyway.)

¹ Yes, it's legal to write an alloca() that simply calls system("/usr/games/nethack").

score 4 · Answer 5 · answered Apr 20 '13 at 22:10

Continuation Passing Style Alloca

Variable-Length Array in pure ISO C++. Proof-of-Concept implementation.

Usage

void foo(unsigned n)
{
    cps_alloca<Payload>(n,[](Payload *first,Payload *last)
    {
        fill(first,last,something);
    });
}

Core Idea

template<typename T,unsigned N,typename F>
auto cps_alloca_static(F &&f) -> decltype(f(nullptr,nullptr))
{
    T data[N];
    return f(&data[0],&data[0]+N);
}

template<typename T,typename F>
auto cps_alloca_dynamic(unsigned n,F &&f) -> decltype(f(nullptr,nullptr))
{
    vector<T> data(n);
    return f(&data[0],&data[0]+n);
}

template<typename T,typename F>
auto cps_alloca(unsigned n,F &&f) -> decltype(f(nullptr,nullptr))
{
    switch(n)
    {
        case 1: return cps_alloca_static<T,1>(f);
        case 2: return cps_alloca_static<T,2>(f);
        case 3: return cps_alloca_static<T,3>(f);
        case 4: return cps_alloca_static<T,4>(f);
        case 0: return f(nullptr,nullptr);
        default: return cps_alloca_dynamic<T>(n,f);
    }; // mpl::for_each / array / index pack / recursive bsearch / etc variacion
}

LIVE DEMO

cps_alloca on github

score 3 · Answer 6 · edited Apr 03 '09 at 18:36

alloca is directly implemented in assembly code. That's because you cannot control stack layout directly from high level languages.

Also note that most implementation will perform some additional optimization like aligning the stack for performance reasons. The standard way of allocating stack space on X86 looks like this:

sub esp, XXX

Whereas XXX is the number of bytes to allcoate

Edit:
If you want to look at the implementation (and you're using MSVC) see alloca16.asm and chkstk.asm.
The code in the first file basically aligns the desired allocation size to a 16 byte boundary. Code in the 2nd file actually walks all pages which would belong to the new stack area and touches them. This will possibly trigger PAGE_GAURD exceptions which are used by the OS to grow the stack.

technosaurus · Answer 7 · 2015-07-04T09:30:55.033

1

If you can't use c99's Variable Length Arrays, you can use a compound literal cast to a void pointer.

#define ALLOCA(sz) ((void*)((char[sz]){0}))

This also works for -ansi (as a gcc extension) and even when it is a function argument;

some_func(&useful_return, ALLOCA(sizeof(struct useless_return)));

The downside is that when compiled as c++, g++>4.6 will give you an error: taking address of temporary array ... clang and icc don't complain though

edited Jul 04 '15 at 09:30

answered Jul 04 '15 at 08:18

technosaurus

7,676
1
30
52

I feel as though it should be noted that the reason C++, G++ both give errors is simply because it is erratic in C++. C99 has VLA's, C++ does not. I don't feel that you totally omitted this, but calling it out specifically would improve the quality of the answer. – Joshua Hedges Sep 18 '18 at 17:45
That won't work for two reasons: If `sz` is dynamic `(char[sz]){0}` isn't valid C. gcc/clang won't let you have variable-sized compound literals. The second reason is aliasing. A char array has a declared type. You can't make it behave like allocated memory, which has no declared type. – Petr Skocik Feb 05 '19 at 16:00
I believe scoping would be different for VLAs too (more restrictive) – Arran Cudbard-Bell Dec 16 '20 at 03:29

score 1 · Answer 8 · answered Apr 04 '09 at 13:27

1

You can examine sources of an open-source C compiler, like Open Watcom, and find it yourself

answered Apr 04 '09 at 13:27

dmityugov

4,390
23
18

StandBone · Answer 9 · 2018-04-06T12:44:25.393

-1

What we want to do is something like that:

void* alloca(size_t size) {
    <sp> -= size;
    return <sp>;
}

In Assembly (Visual Studio 2017, 64bit) it looks like:

;alloca.asm

_TEXT SEGMENT
    PUBLIC alloca
    alloca PROC
        sub rsp, rcx ;<sp> -= size
        mov rax, rsp ;return <sp>;
        ret
    alloca ENDP
_TEXT ENDS

END

Unfortunately our return pointer is the last item on the stack, and we do not want to overwrite it. Additionally we need to take care for the alignment, ie. round size up to multiple of 8. So we have to do this:

;alloca.asm

_TEXT SEGMENT
    PUBLIC alloca
    alloca PROC
        ;round up to multiple of 8
        mov rax, rcx
        mov rbx, 8
        xor rdx, rdx
        div rbx
        sub rbx, rdx
        mov rax, rbx
        mov rbx, 8
        xor rdx, rdx
        div rbx
        add rcx, rdx

        ;increase stack pointer
        pop rbx
        sub rsp, rcx
        mov rax, rsp
        push rbx
        ret
    alloca ENDP
_TEXT ENDS

END

edited Apr 06 '18 at 12:44

answered Mar 19 '18 at 14:44

StandBone

9
4

Returning with a modified RSP violates the caller's assumptions. Also, you don't need a very slow `div` instruction to round up to a multiple of 8, let alone two of them. (Also, Windows x64 requires / guarantees 16-byte stack alignment) `pop rax` (grab the return address) / `sub rsp, rcx` / `and rsp, -16` (round RSP down, rounding the allocation size up) / `push rax` / `ret` would fix that, but still have the showstopper bug of violating the calling convention, breaking the caller's RSP-relative access to its stack vars after return. – Peter Cordes May 27 '22 at 06:41
https://godbolt.org/z/Y4nhvaWv9 shows gcc/MSVC callers (with optimization enabled) that would be broken by this, as they restore their own RSP to point at their return address with `add rsp, 32`, not `mov rsp, rbp`. `gcc -fomit-frame-pointer` has been enabled for decades on x86-64. Also shows GCC's over-complicated alloca code, and MSVC's even more complicated `_alloca` code. And GCC's C99 VLA code, a bit less bad. – Peter Cordes May 27 '22 at 06:46

Ana Betts · Answer 10 · 2009-04-03T23:00:41.897

-1

Alloca is easy, you just move the stack pointer up; then generate all the read/writes to point to this new block

sub esp, 4

edited Apr 03 '09 at 23:00

answered Apr 03 '09 at 16:30

Ana Betts

73,868
16
141
209

1

1) it's not esi 2) stack grows from high to low adresses – newgre Apr 03 '09 at 16:40

user3377317 · Answer 11 · 2022-05-27T06:32:12.087

-1

my_alloca:  ; void *my_alloca(int size);
    MOV     EAX, [ESP+4]    ; get size
    ADD     EAX,-4          ; include return address as stack space(4bytes)
    SUB     ESP,EAX
    JMP     DWORD [ESP+EAX]     ; replace RET(do not pop return address)

edited May 27 '22 at 06:32

answered May 27 '22 at 03:49

user3377317

1
1

`[ESP-4]` is *below* the return address. In a legacy stack-args calling convention, the first arg is at `[esp+4]` on function entry. Also, this "function" violates the calling convention by returning with a modified ESP, so it can only work in code compiled with `gcc -fno-omit-frame-pointer` or equivalent. I realize this function's doing it on purpose as a way to allocate space behind the compiler's back, but it's not usable. `-fomit-frame-pointer` is enabled at `gcc -O1`, and MSVC. Also it doesn't maintain 16-byte stack alignment like the Linux version of the i386 SysV ABI requires. – Peter Cordes May 27 '22 at 04:41
https://godbolt.org/z/9EebhncEE shows GCC and MSVC for 32-bit Linux and Windows respectively making asm (with `-O2` optimization enabled) that calls this function and does *not* use `leave` afterwards. e.g. MSVC's epilogue is `add esp, 12`/ `pop esi` / `ret`, which will fail to find the correct return address if some callee has moved ESP. If your code is only supposed to work for a more limited type of caller, your answer needs to say so. – Peter Cordes May 27 '22 at 04:46
This could be an ok answer if it discussed why this *isn't* safe, and only happens to work in some cases. But Evan Teran's answer pretty much does that already. – Peter Cordes May 27 '22 at 06:50
Sorry for my bad idea. This code not considering function return value(assign address to EAX) too. This code is only an implementation example cooperating with your own compiler. As you saying that the compiler not balance the ESP from caller(thats the rule of C call), so the alloca implementation only can do itself in compiler. – user3377317 May 27 '22 at 07:10
If this is a compiler built-in like it has to be, why make a function-call at all? https://godbolt.org/z/ehbvP8G87 shows GCC/MSVC's `alloca` / `_alloc` (and C99 VLA for GCC, but MSVC isn't a C99 compiler.) They're over-complicated for simple cases like this (although GCC has to maintain 16-byte stack alignment), but at least they don't cause branch mispredicts by unbalancing the return-address predictor with a `jmp` instead of a `push`/`ret`. Anyway, you could edit your answer to explain what it's for and what it does/doesn't do, or if you want you could delete it. – Peter Cordes May 27 '22 at 07:23

score -2 · Answer 12 · answered Apr 03 '09 at 16:34

-2

I recommend the "enter" instruction. Available on 286 and newer processors (may have been available on the 186 as well, I can't remember offhand, but those weren't widely available anyways).

answered Apr 03 '09 at 16:34

Brian Knoblauch

20,639
15
57
92

unfortunately, the enter instruction is fairly useless this purpose (implementing alloca in a higher level language) simply because you wouldn't get enough compiler cooperation. – Evan Teran Apr 03 '09 at 21:45
1

You definitely don't want [ENTER](http://www.felixcloutier.com/x86/ENTER.html) in inline-asm, because it overwrites EBP so the compiler won't know where its locals are. It's also extremely slow on modern CPUs, which is why compilers use `push ebp/mov ebp,esp/sub esp, N`. So really you never want ENTER, even if writing a stand-alone function in asm. – Peter Cordes Nov 19 '16 at 21:36

Alloca implementation

12 Answers12

Continuation Passing Style Alloca

Usage

Core Idea

Linked