Unable to understand example of cdecl calling convention where caller doesnt need to clean the stack

Question

I am reading the IDA Pro Book. On page 86 while discussing calling conventions, the author shows an example of cdecl calling convention that eliminates the need for the caller to clean arguments off the stack. I am reproducing the code snippet below:

; demo_cdecl(1, 2, 3, 4); //programmer calls demo_cdecl
mov [esp+12], 4 ; move parameter z to fourth position on stack
mov [esp+8], 3 ; move parameter y to third position on stack
mov [esp+4], 2 ; move parameter x to second position on stack
mov [esp], 1 ; move parameter w to top of stack
call demo_cdecl ; call the function

The author goes on to say that

in the above example, the compiler has preallocated storage space for the arguments to demo_cdecl at the top of the stack during the function prologue.

I am going to assume that there is a sub esp, 0x10 at the top of the code snippet. Otherwise, you would just be corrupting the stack.

He later says that the caller doesn't need to adjust the stack when call to demo_cdecl completes. But surely, there has to be a add esp, 0x10 after the call.

What exactly am I missing?

if the call completes with `ret 0x10`, the `ret` instruction will adjust the `esp` register. So check the subroutine machine code, which kind of `ret` it does use. EDIT: Or if this is book without code of subroutine, then pay attention to the calling convention definition, it may be defined as to use the [`ret imm16`](http://www.felixcloutier.com/x86/RET.html) to adjust the stack in the subroutine. I can't recall which platform exactly does use this one (some windows?), but I really don't like it, luckily on linux the calling convention is different, so I don't care. — Ped7g, Mar 27 '18 at 13:12
The caller still needs to clear the stack, it's just not done immediately, rather, in the function epilogue. This style effectively merges the arguments into the local variables. The very fact that it's `cdecl` means caller clean up. Callee clean up is called `stdcall` or `pascal` (depending on order). — Jester, Mar 27 '18 at 13:33
@Ped7g: it's `cdecl`, the callee ends with a normal `ret`. That *is* the 32-bit Linux calling convention, probably compiled by gcc with [`-maccumulate-outgoing-args` that defers clearing the stack until the end of the function.](https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html#index-maccumulate-outgoing-args-1), and avoids using `push` even for the initial growth. This used to be a good thing, but is [now just a waste of code-size and instructions.](https://stackoverflow.com/questions/49485395/what-c-c-compiler-can-use-push-pop-instructions-for-creating-local-variables/49503043#49503043) — Peter Cordes, Mar 27 '18 at 16:25
@PeterCordes thanks for correction, I don't remember the calling conventions by their name. Actually reading my comment again, I did wrote it in a bit ambiguous way, as I was not sure, so it's not plain incorrect, but still it's nice you did correct that definitely. — Ped7g, Mar 27 '18 at 19:07
mind that, in your example, the values are *not* pushed, but mov'ed to the stack without changing the stack pointer. this way there's no need to sub anything — Tommylee2k, Mar 28 '18 at 06:14

Hadi Brais · Answer 1 · 2018-03-27T16:47:06.563

I am going to assume that there is a sub esp, 0x10 at the top of the code snippet. Otherwise, you would just be corrupting the stack.

The parameters are stored at addresses that are positive offsets from the stack pointer. Remember that the stack grows downwards. This means that the space required to hold these parameters has already been allocated (probably by the caller's prologue code). That's why there is no need for sub esp, N for each call sequence.

He later says that the caller doesn't need to adjust the stack when call to demo_cdecl completes. But surely, there has to be a add esp, 0x10 after the call.

In the cdecl calling convention, the caller always has to clean up the stack one way or another. If allocation was done by the caller's prologue, it will be deallocated by the epilogue (together with the caller's local variables). Otherwise, if the parameters of the callee were allocated somewhere in the middle of the caller's code, then the easiest way to clean up is by using add esp, N right after the call instruction.

There is a trade-off involved between these two different implementations of the cdecl calling convention. Allocating parameters in the prologue means that the largest space required by any callee must be allocated. It will be reused for each callee. Then at the end of the caller, it will be cleaned up once. So this may unnecessarily waste stack space, but it may improve performance. In the other technique, the caller only allocates space for parameters when the associated call site is actually going to be reached. Cleanup is then performed right after the callee returns. So no stack space is wasted. But allocation and cleanup have to be performed at each call site in the caller. You can also imagine an implementation that is in between these two extremes.

Re: your first pagraph: yes, that's what the OP meant by "corrupting the stack". If you hadn't done `sub esp, 0x10`, you'd be scribbling over your return address and args, or the caller's stack memory. I don't think the OP was worried that `call` is going to clobber the args you just wrote. — Peter Cordes, Mar 27 '18 at 16:29
@PeterCordes I'm not sure I understand. Did I make a mistake in the first paragraph? I said that space on stack has already been allocated in the prologue of the caller to hold the parameters of the callee. That's why the callee won't corrupt the the stack because the stack pointer already points to the top of stack. That's what I'm trying to say in the first paragraph. — Hadi Brais, Mar 27 '18 at 16:37
The OP is saying the function needs to have already allocated space to store these args (not overlapping anything else) with something like `sub esp, 0x10`. You're saying that `sub esp, 0x10` is unnecessary, *and* that it has to have already happened. >.< I think you're thinking of this one call as part of a larger function, and you mean that each call sequence doesn't need its own `sub esp, N`, just a single `sub esp, max_N` at the top. That's what the OP means, too, I think. You're assuming the wrong misunderstanding on the part of the OP, I think. — Peter Cordes, Mar 27 '18 at 16:40
@PeterCordes Yeah I didn't mean that. I meant that `sub esp, N` is required only once in the prologue of the caller; there is no need to allocate stack space at every call site. So yes, each call sequence doesn't need its own `sub esp, N`. That's what I meant. I'll clarify that part of my answer. — Hadi Brais, Mar 27 '18 at 16:44

score 1 · Answer 2 · answered Mar 27 '18 at 17:07

Compilers often choose mov to store args instead of push, if there's enough space already allocated (e.g. with a sub esp, 0x10 earlier in the function like you suggested).

Here's an example:

int f1(int);
int f2(int,int);

int foo(int a) {
    f1(2);
    f2(3,4);

    return f1(a);
}

compiled by clang6.0 -O3 -march=haswell on Godbolt

    sub     esp, 12                # reserve space to realign stack by 16
    mov     dword ptr [esp], 2     # store arg
    call    f1(int)
                    # reuse the same arg-passing space for the next function
    mov     dword ptr [esp + 4], 4  
    mov     dword ptr [esp], 3
    call    f2(int, int)
    add     esp, 12
                    # now ESP is pointing to our own arg
    jmp     f1(int)                  # TAILCALL

clang's code-gen would have been even better with sub esp,8 / push 2, but then the rest of the function unchanged. i.e. let push grow the stack because it has smaller code-size that mov, especially mov-immediate, and performance is not worse (because we're about to call which also uses the stack engine). See What C/C++ compiler can use push pop instructions for creating local variables, instead of just increasing esp once? for more details.

I also included in the Godbolt link GCC output with/without -maccumulate-outgoing-args that defers clearing the stack until the end of the function..

By default (without accumulate outgoing args) gcc does let ESP bounce around, and even uses 2x pop to clear 2 args from the stack. (Avoiding a stack-sync uop, at the cost of 2 useless loads that hit in L1d cache). With 3 or more args to clear, gcc uses add esp, 4*N. I suspect that reusing the arg-passing space with mov stores instead of add esp / push would be a win sometimes for overall performance, especially with registers instead of immediates. (push imm8 is much more compact than mov imm32.)

foo(int):            # gcc7.3 -O3 -m32   output
    push    ebx
    sub     esp, 20
    mov     ebx, DWORD PTR [esp+28]    # load the arg even though we never need it in a register
    push    2                          # first function arg
    call    f1(int)
    pop     eax
    pop     edx                        # clear the stack
    push    4
    push    3                          # and write the next two args
    call    f2(int, int)
    mov     DWORD PTR [esp+32], ebx    # store `a` back where we it already was
    add     esp, 24
    pop     ebx
    jmp     f1(int)                    # and tailcall

With -maccumulate-outgoing-args, the output is basically like clang, but gcc still save/restores ebx and keeps a in it, before doing a tailcall.

Note that having ESP bounce around requires extra metadata in .eh_frame for stack unwinding. Jan Hubicka writes in 2014:

There are still pros and cons of arg accumulation. I did quite extensive testing on AMD chips and found it performance neutral. On 32bit code it saves about 4% of code but with frame pointer disabled it expands unwind info quite a lot, so resulting binary is about 8% bigger. (This is also current default for -Os)

So a 4% code-size saving (in bytes; matters for L1i cache footprint) from using push for args and at least typically clearing them off the stack after each call. I think there's a happy medium here that gcc could use more push without using just push/pop.

There's a confounding effect of maintaining 16-byte stack alignment before call, which is required by the current version of the i386 System V ABI. In 32-bit mode, it used to just be a gcc default to maintain -mpreferred-stack-boundary=4. (i.e. 1<<4). I think you can still use -mpreferred-stack-boundary=2 to violate the ABI and make code that only cares about 4B alignment for ESP.

I didn't try this on Godbolt, but you could.

Unable to understand example of cdecl calling convention where caller doesnt need to clean the stack

2 Answers2

Linked