9

I encountered an unexpectedly early stack overflow and created the following program to test the issue:

#![feature(asm)]
#[inline(never)]
fn get_rsp() -> usize {
    let rsp: usize;
    unsafe {
        asm! {
            "mov {}, rsp",
            out(reg) rsp
        }
    }
    rsp
}

fn useless_function(x: usize) {
    if x > 0 {
        println!("{:x}", get_rsp());
        useless_function(x - 1);
    }
}

fn main() {
    useless_function(10);
}

This is get_rsp disassembled (according to cargo-asm):

tests::get_rsp:
 push    rax
 #APP
 mov     rax, rsp
 #NO_APP
 pop     rcx
 ret

I'm not sure what #APP and #NO_APP do or why rax is pushed and then popped into rcx, but it seems the function does return the stack pointer.

I was surprised to find that in debug mode, the difference between two consecutively printed rsp was 192(!) and even in release mode it was 128. As far as I understand, all that needs to be stored for each call to useless_function is one usize and a return address, so I'd expect every stack frame to be around 16 bytes large.

I'm running this with rustc 1.46.0 on a 64-bit Windows machine.

Are my results consistent across machine? How is this explained?


It seems that the use of println! has a pretty significant effect. In an attempt to avoid that, I changed the program (Thanks to @Shepmaster for the idea) to store the values in a static array:

static mut RSPS: [usize; 10] = [0; 10];

#[inline(never)]
fn useless_function(x: usize) {
    unsafe { RSPS[x] = get_rsp() };
    if x == 0 {
        return;
    }
    useless_function(x - 1);
}

fn main() {
    useless_function(9);
    println!("{:?}", unsafe { RSPS });
}

The recursion gets optimised away in release mode, but in debug mode each frame still takes 80 bytes which is way more than I anticipated. Is this just the way stack frames work on x86? Do other languages do better? This seems a little inefficient.

Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
chuck
  • 1,420
  • 4
  • 19
  • 1
    `#APP` / `#NO_APP` => Comments to delineate where **app**lication assembly starts and ends. – Shepmaster Sep 22 '20 at 19:11
  • 2
    `#APP` is standard GAS stuff that GCC (and apparently LLVM) puts around inline asm text. Historically, GAS had a fast parsing mode for compiler-generated asm that assumed it followed the fixed layout that compiler output used, like maybe assuming numbers always in decimal, no asm macros, etc. The dummy push/pop is for stack alignment, slightly more efficient than sub/add rsp, 8. [Why does this function push RAX to the stack as the first operation?](https://stackoverflow.com/q/37773787) – Peter Cordes Sep 22 '20 at 19:14
  • Have you researched the [standard calling conventions](https://en.wikipedia.org/wiki/X86_calling_conventions#List_of_x86_calling_conventions) for your CPU and OS? The most relevant parts are the one about caller-saved registers. – Stephen M. Webb Sep 22 '20 at 19:16
  • On Godbolt, we see a stack frame size of 0x60 bytes. https://godbolt.org/z/r9Wrvv. But it's not showing the asm output (unless you do "compile to binary"). I barely know Rust so IDK what's happening there. Also, IDK why it's not optimizing that tail-call into a loop. – Peter Cordes Sep 22 '20 at 19:20
  • On x86, each stack frame at least contains a copy of the frame pointer and a return address. In addition `x` is passed on the stack, plus potentially caller-saved registers, plus the temp data needed by the format machinery. – Sven Marnach Sep 22 '20 at 19:22
  • @SvenMarnach: `x` is passed in a *register* in x86-64 calling conventions. But without inlining functions, it does have to use stack space to save `x` across the call to `get_rsp`. It saves RBX, then copies `x` to RBX. Check the asm on https://godbolt.org/z/YE84vh (compiling for Linux, not Windows like the OP is doing, but the Windows x64 calling convention(s) pass args in registers, too). So yes, 8 byte return address, 8 byte saved RBX, then `sub rsp, 0x50` to reserve space for locals. And it fails to optimize the tailcall into a loop. – Peter Cordes Sep 22 '20 at 19:25
  • Your question appears to have changed to be "why is unoptimized code bigger than optimized code"... the answer to that is that it's unoptimized. – Shepmaster Sep 22 '20 at 19:59
  • @Shepmaster I'm not surprised that unoptimised code uses more stack space, but I'm still surprised that it uses that much. I just couldn't get the compiler to emit an optimised version which doesn't optimise the recursion away in the second example, because the example is so small. The question is still about stack usage in general. – chuck Sep 22 '20 at 20:02
  • @Shepmaster: I wouldn't go that far. There's a reasonable question about why exactly it's 80 bytes, when it seems reasonable that it might only need 32 (shadow space size) + 16 (return address and stack alignment) on Windows, if it's not optimizing the tailcall into a loop. But I guess we end up with local variables consisting of multiple pointers and they all need stack space. (And/or the compiler might just straight up waste some space because you told it not to optimize, just compile fast.) – Peter Cordes Sep 22 '20 at 20:10
  • 1
    @PeterCordes so the question is "why does this function take 80 bytes of stack space when compiled in debug mode instead of 48 bytes"? – Shepmaster Sep 22 '20 at 20:13
  • @Shepmaster: Yes. That's what I'd expect in C for storing to a global array in a tailcall "loop", even in debug mode, in the Windows x64 calling convention. – Peter Cordes Sep 22 '20 at 20:15
  • @PeterCordes shall I rewrite the question to clearly state that and then delete my answer, as it doesn't answer the question that you (and presumably OP) think is being asked? – Shepmaster Sep 22 '20 at 20:17
  • @Shepmaster: Sorry, I only meant that was the "bonus question" the OP tacked on in an edit. Your answer does cover a lot of why it's large in general with printing as part of the recursive function. If an edit changes the question in ways that invalidate existing answers, the edit is the problem no the answer. (An edit that adds a related more specific question is less problematic but not great either, but not something that should result in deleting an answer. Still, someone else might want to write another answer that comes at it from a different angle.) – Peter Cordes Sep 22 '20 at 20:18
  • 2
    @Shepmaster: I tried GCC and clang on an un-optimized C version using the Windows calling convention: https://godbolt.org/z/vMoWsG. As we can see, clang/LLVM un-optimized does `push rbp` / `sub rsp, 48`, so a total of 64 bytes per frame (including the return address). GCC does push / `sub rsp,32`, for a total of only 48 bytes per frame, as predicted. So apparently un-optimized LLVM does sometimes allocate unneeded space because it fails to use the shadow space allocated by the caller. – Peter Cordes Sep 22 '20 at 20:25
  • I run this modified code and it shows that frames consist only 48 bytes. https://play.rust-lang.org/?version=nightly&mode=debug&edition=2018&gist=5be0d9c5033ab93a6e67cd02cb3b2153 – Angelicos Phosphoros Sep 29 '20 at 20:39

2 Answers2

12

Using formatting machinery like println! creates a number of things on the stack. Expanding the macros used in your code:

fn useless_function(x: usize) {
    if x > 0 {
        {
            ::std::io::_print(::core::fmt::Arguments::new_v1(
                &["", "\n"],
                &match (&get_rsp(),) {
                    (arg0,) => [::core::fmt::ArgumentV1::new(
                        arg0,
                        ::core::fmt::LowerHex::fmt,
                    )],
                },
            ));
        };
        useless_function(x - 1);
    }
}

I believe that those structs consume the majority of the space. As an attempt to prove that, I printed the size of the value created by format_args, which is used by println!:

let sz = std::mem::size_of_val(&format_args!("{:x}", get_rsp()));
println!("{}", sz);

This shows that it is 48 bytes.

See also:


Something like this should remove the printing from the equation, but the compiler / optimizer ignores the inline(never) hint here and inlines it anyway, resulting in the sequential values all being the same.

/// SAFETY:
/// The length of `rsp` and the value of `x` must always match
#[inline(never)]
unsafe fn useless_function(x: usize, rsp: &mut [usize]) {
    if x > 0 {
        *rsp.get_unchecked_mut(0) = get_rsp();
        useless_function(x - 1, rsp.get_unchecked_mut(1..));
    }
}

fn main() {
    unsafe {
        let mut rsp = [0; 10];
        useless_function(rsp.len(), &mut rsp);

        for w in rsp.windows(2) {
            println!("{}", w[0] - w[1]);
        }
    }
}

That said, you can make the function public and look at its assembly anyway (lightly cleaned):

playground::useless_function:
    pushq   %r15
    pushq   %r14
    pushq   %rbx
    testq   %rdi, %rdi
    je  .LBB6_3
    movq    %rsi, %r14
    movq    %rdi, %r15
    xorl    %ebx, %ebx

.LBB6_2:
    callq   playground::get_rsp
    movq    %rax, (%r14,%rbx,8)
    addq    $1, %rbx
    cmpq    %rbx, %r15
    jne .LBB6_2

.LBB6_3:
    popq    %rbx
    popq    %r14
    popq    %r15
    retq

but in debug mode each frame still takes 80 bytes

Compare the unoptimized assembly:

playground::useless_function:
    subq    $104, %rsp
    movq    %rdi, 80(%rsp)
    movq    %rsi, 88(%rsp)
    movq    %rdx, 96(%rsp)
    cmpq    $0, %rdi
    movq    %rdi, 56(%rsp)                  # 8-byte Spill
    movq    %rsi, 48(%rsp)                  # 8-byte Spill
    movq    %rdx, 40(%rsp)                  # 8-byte Spill
    ja  .LBB44_2
    jmp .LBB44_8

.LBB44_2:
    callq   playground::get_rsp
    movq    %rax, 32(%rsp)                  # 8-byte Spill
    xorl    %eax, %eax
    movl    %eax, %edx
    movq    48(%rsp), %rdi                  # 8-byte Reload
    movq    40(%rsp), %rsi                  # 8-byte Reload
    callq   core::slice::<impl [T]>::get_unchecked_mut
    movq    %rax, 24(%rsp)                  # 8-byte Spill
    movq    24(%rsp), %rax                  # 8-byte Reload
    movq    32(%rsp), %rcx                  # 8-byte Reload
    movq    %rcx, (%rax)
    movq    56(%rsp), %rdx                  # 8-byte Reload
    subq    $1, %rdx
    setb    %sil
    testb   $1, %sil
    movq    %rdx, 16(%rsp)                  # 8-byte Spill
    jne .LBB44_9
    movq    $1, 72(%rsp)
    movq    72(%rsp), %rdx
    movq    48(%rsp), %rdi                  # 8-byte Reload
    movq    40(%rsp), %rsi                  # 8-byte Reload
    callq   core::slice::<impl [T]>::get_unchecked_mut
    movq    %rax, 8(%rsp)                   # 8-byte Spill
    movq    %rdx, (%rsp)                    # 8-byte Spill
    movq    16(%rsp), %rdi                  # 8-byte Reload
    movq    8(%rsp), %rsi                   # 8-byte Reload
    movq    (%rsp), %rdx                    # 8-byte Reload
    callq   playground::useless_function
    jmp .LBB44_8

.LBB44_8:
    addq    $104, %rsp
    retq

.LBB44_9:
    leaq    str.0(%rip), %rdi
    leaq    .L__unnamed_7(%rip), %rdx
    movq    core::panicking::panic@GOTPCREL(%rip), %rax
    movl    $33, %esi
    callq   *%rax
    ud2
Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
  • I think that change let it turn the tail-recursion into a loop, otherwise RSP would change by at least `0x10` (return address plus re-aligning the stack before another call). – Peter Cordes Sep 22 '20 at 19:28
  • Thanks for the answer! It does make sense that the `println!` causes some overhead, but I disassembled the result and @PeterCordes is correct - the compiler removed `useless_function` completely, and just called `get_rsp` directly from `main` without recursion. – chuck Sep 22 '20 at 19:29
  • @PeterCordes yep, just addressing that. – Shepmaster Sep 22 '20 at 19:31
  • Thanks for the detailed answer. I still don't understand *why* the program requires so much stack space, it just seems unnecessary, but I guess that's the way it is... – chuck Sep 22 '20 at 19:59
3

This answer shows how this works in asm for an un-optimized C++ version.

This might not tell us as much as I thought about Rust; apparently Rust uses its own ABI / calling convention so it won't have "shadow space" making its stack frames bulkier on Windows. The first version of my answer guessed that it would follow the Windows calling convention for calls to other Rust functions, when targeting Windows. I've adjusted the wording, but I didn't delete it even though it's potentially not relevant to Rust.

After further research, at least in 2016 Rust's ABI happens to match the platform calling convention on Windows x64, at least if disassembly of the debug-build binary in this random tutorial is representative of anything. heap::allocate::h80a36d45ddaa4ae3Lca in the disassembly clearly takes args in RCX and RDX, (spills and reloads them to the stack), then calls another function with those args. Leaving 0x20 bytes of space unused above RSP before the call, i.e. shadow space.

If nothing has changed since 2016 (easily possible), I think this answer does reflect some of what Rust does when compiling for Windows.


The recursion gets optimised away in release mode, but in debug mode each frame still takes 80 bytes which is way more than I anticipated. Is this just the way stack frames work on x86? Do other languages do better?

Yes, C and C++ do better: 48 or 64 bytes per stack frame on Windows, 32 on Linux.

The Windows x64 calling convention requires a caller to reserve 32 bytes of shadow space (basically unused stack-arg space above the return address) for use by the callee. But it looks like un-optimized clang builds may not take advantage of that shadow space, allocating extra space to spill local vars.

Also, the return address takes 8 bytes, and re-aligning the stack by 16 before another call takes another 8 bytes, so the minimum you can hope for is 48 bytes on Windows (unless you enable optimization, then as you say, tail-recursion easily optimizes into a loop). GCC compiling a C or C++ version of that code does achieve that.

Compiling for Linux, or any other x86-64 target that uses the x86-64 System V ABI, gcc and clang manage 32 bytes per frame for a C or C++ version. Just ret addr, saved RBP, and another 16 bytes to keep alignment while making room to spill 8-byte x. (Compiling as C or as C++ makes no difference to the asm).


I tried GCC and clang on an un-optimized C++ version using the Windows calling convention on the Godbolt compiler explorer. To just look at the asm for useless_function, there was no need to write a main or get_rsp.

#include <stdlib.h>

#define MS_ABI __attribute__((ms_abi))   // for GNU C compilers.  Godbolt link has an ifdeffed version of this

void * RSPS[10] = {0};

MS_ABI void *get_rsp(void);
MS_ABI void useless_function(size_t x) {
    RSPS[x] = get_rsp();
    if (x == 0) {
        return;
    }
    useless_function(x - 1);
}

clang/LLVM un-optimized does push rbp / sub rsp, 48, so a total of 64 bytes per frame (including the return address). GCC does push / sub rsp,32, for a total of only 48 bytes per frame, as predicted.

So apparently un-optimized LLVM does allocate "unneeded" space because it fails to use the shadow space allocated by the caller. If Rust used shadow space, this might explains some of why your debug-mode Rust version might use more stack space than we might expect, even with printing done outside the recursive function. (Printing uses a lot of space for locals).

But part of that explanation must also include having some locals that take more space, e.g. perhaps for pointer locals or bounds checks? C and C++ map pretty directly to asm, with access to globals not needing any extra stack space. (Or even extra registers, when the global array can be assumed to be in the low 2GiB of virtual address space, so it's address is usable as a 32-bit signed displacement in combination with other registers.)

# clang 10.0.1 -O0, for Windows x64
useless_function(unsigned long):
        push    rbp
        mov     rbp, rsp                  # set up a legacy frame pointer.
        sub     rsp, 48                   # reserve enough for shadow space (32) + 16, maintaining stack alignment.
        mov     qword ptr [rbp - 8], rcx   # spill incoming arg to newly reserved space above the shadow space
        call    get_rsp()
...

The only space for locals used on the stack is for x, no invented temporaries as part of array access. It's just a reload of x then mov qword ptr [8*rcx + RSPS], rax to store the function call return value.

# GCC10.2 -O0, for Windows x64
useless_function(unsigned long):
        push    rbp
        mov     rbp, rsp
        sub     rsp, 32                   # just reserve enough for shadow space for callee
        mov     QWORD PTR [rbp+16], rcx   # spill incoming arg to our own shadow space
        call    get_rsp()
...

Without the ms_abi attribute, both GCC and clang use sub rsp, 16.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • *is the Windows x64 calling convention* — why would the Rust code using the Rust ABI follow that calling convention? – Shepmaster Sep 22 '20 at 20:45
  • *if Rust arrays know their own size* — **array** sizes are part of the type and thus thinner. Contrast that to **slices** which are fat pointers of data + length. – Shepmaster Sep 22 '20 at 20:46
  • @Shepmaster: thanks for the corrections, added a caveat to my answer. I assumed Rust would use the system calling convention. The OP says they saw 128 bytes (0x80) between frames on Windows with an optimized build, but on Godbolt with Rust running the Linux executable built from the OP's original source, I only saw `0x60`. I guessed that that was explained by the Windows build leaving shadow space. (Which wasn't large enough for some single object used by printing). I don't know how to get my hands on asm output for Rust targeting Windows so I can't check. – Peter Cordes Sep 22 '20 at 21:08
  • @Shepmaster: I don't have a Windows Rust cross compiler, so I googled and found an existing Windows binary built from Rust source by a Rust compiler. (on some random tutorial which had source and binary download links: https://www.codeproject.com/Tips/1053658/Win-GUI-Programming-In-Rust-Language). It certainly looks like it *is* using the Windows x64 calling convention, including shadow space, even for calls to internal functions. (It's a debug build with symbols. `heap::allocate::h80a36d45ddaa4ae3Lca` is a simple wrapper). 2016 is forever ago for Rust, though, so IDK if that's relevant. – Peter Cordes Oct 02 '20 at 09:56
  • If you are energized to test it out, what I do is download a Windows VM from http://modern.ie and then install the Rust compiler (which instructs you how to install the relevant Visual Studio components) – Shepmaster Oct 02 '20 at 11:19
  • To be clear, I don’t know that you are wrong about the ABI, I just know that the Rust ABI isn’t fixed and functions can choose their calling convention (`extern “C”`, for example) – Shepmaster Oct 02 '20 at 11:22
  • @Shepmaster: I'm not that keen :/ The most I'd be interested in installing is a cross-compiler. Or I guess I could install something under Wine, or on one of the Windows machines my family members have. IIRC, the OP's test results (sizes of stack frames) would be explained by my conclusions about LLVM targeting Windows x64, extrapolating from what it does when targeting Linux, if Rust for Windows uses Windows x64. – Peter Cordes Oct 02 '20 at 11:23