1

Coming from C and C++, I have recently started to learn x86-64 assembly to understand better the workings of my programs.

I know that the convention in x64 assembly is to reserve 32 bytes of 'shadow store' on the stack before calling a function (by doing: subq $0x20, %rsp).

What I am unsure about is: is the callee responsible for incrementing %rsp again, or the caller?

In other words (using printf as an example), would number 1 or number 2 be correct (or perhaps neither :P)?

1.

subq $0x20, %rsp
movabsq $msg, %rcx
callq printf
subq $0x20, %rsp
movabsq $msg, %rcx
callq printf
addq $0x20, %rsp

(... where msg is an ascii string stored in the .data section that I am passing to printf)

I am on Windows 10, using GAS as my assembler.

Any help would be much appreciated, cheers.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    `printf` is a [varargs](https://en.wikipedia.org/wiki/Variadic_function) function. In the general case, it is legal in C to pass to a varargs function more parameters than it uses -- so it must be the caller that pops arguments as they're the only one that really knows how many parameters were passed and so how many to pop. I'm sure there are some nuances to this, though. It is possible to allocate shadow store in prologue -- it doesn't have to be done per call site. – Erik Eidt Jul 31 '22 at 23:26
  • @ErikEidt yes, thank you, that seems perfectly reasonable as you say because `printf` itself has no clue regarding the actual number of parameters you have passed to it. However, the point that is confounding me is this 'shadow store' of 32 bytes that have to be allocated on the stack before calling any function (to guarantee the function a minimum of 32 bytes it can use). Do I (the caller) free these bytes, or must the function do it? – Gregor Hartl Watters Jul 31 '22 at 23:31
  • 2
    The caller is expected to manage the shadow and provide it to the callee. The space is owned by the caller: to be deallocated by the caller. It would be minimal for caller to allocate shadow once in prologue and deallocate in epilogue rather than per each call site within the caller's function body. Allocation of all shadow space (and all parameter passing space) can be done by examination of the whole function prior to generation of prologue, epilogue and the code for the body. – Erik Eidt Aug 01 '22 at 00:05
  • 2
    Note also that you need to emit unwind data if your function moves the stack pointer or modifies any non0-volatile registers. – Raymond Chen Aug 01 '22 at 02:14
  • @ErikEidt that is brilliant, you could not have been clearer. Thank you. – Gregor Hartl Watters Aug 01 '22 at 02:26

1 Answers1

2

Deallocating shadow space is the caller's responsibility.

But normally you'd do it once per function, not once per call-site within a function. Usually you just move RSP once (maybe after some pushes) and leave it alone until you're ready to return. That includes making room to store stack args if any for functions with more than 4 args.

In the Windows x64 calling convention (and x86-64 System V), the callee must return without changing the caller's RSP. i.e. with ret, not ret 32, and without having copied the return address somewhere else.

MS has some examples in https://learn.microsoft.com/en-us/cpp/build/prolog-and-epilog?view=msvc-170#epilog-code
And specifically documents that RSP mustn't be changed by functions:

The x64 ABI considers registers RBX, RBP, RDI, RSI, RSP, R12, R13, R14, R15, and XMM6-XMM15 nonvolatile. They must be saved and restored by a function that uses them.

(You also need to emit unwind metadata for every instruction that moves the stack pointer, and about where you saved non-volatile aka call-preserved registers, if you want to be fully compliant with the ABI, including for SEH and C++ exception unwinding. Toy programs still work fine without, as long as you don't expect C++ exceptions to work, or debuggers to unwind the stack back to the stack frame of a caller.)


You can see this if you look at MSVC compiler output, e.g. https://godbolt.org/z/xh38jxWqT , or for AT&T syntax, gcc -O2 -mabi=ms to tell it that all the functions it sees are __attribute__((ms_abi)) by default, but it doesn't override the fact that it's targeting Linux. So with -fPIE to make it use LEA instead of 32-bit absolute addressing for symbol addresses, we also get call printf@plt, not Windows style calls to DLL functions.

But the stack management from GCC matches what MSVC -O2 also does.

#include <stdio.h>

void bar();
int foo(){
    printf("%d\n", 1);
    bar();
    return 1;  // make sure this isn't a tailcall
}
# gcc -O2 -mabi=ms  (but still sort of targeting Linux as far as dynamic linking)
.LC0:
        .string "%d\n"      ## in .rodata

foo():
        subq    $40, %rsp
        movl    $1, %edx
        movl    $.LC0, %ecx      # with -fPIE, uses    leaq    .LC0(%rip), %rcx  like you'd want for Windows x64
        call    printf
        call    bar()
        movl    $1, %eax
        addq    $40, %rsp
        ret

See also How to remove "noise" from GCC/clang assembly output? for more about looking at compiler output - you can answer most questions about how things normally work by looking at what compilers do in practice. Sometimes things compilers do are just a coincidence, especially with optimization disabled (which is why I constructed an example that couldn't inline the functions, so I could still see the calls with optimization enabled). But here we can rule out your alternate hypothesis.

I also constructed this example to show two calls using the same allocation of shadow space, not pointlessly deallocating / reallocating with add/sub. Even with optimization disabled, compilers don't do that.

Re: putting symbol addresses into registers, see How to load address of function or label into register - RIP-relative LEA is the go-to option. It's position-independent, and works in any executable or library smaller than 2GiB of static code+data. And more efficient than movabs.

Sep Roland
  • 33,889
  • 7
  • 43
  • 76
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Your answer is brilliant in more than one way, and has clarified much more than what I had asked for. Thank you very much. I will look further into everything you've mentioned. – Gregor Hartl Watters Aug 01 '22 at 02:31