4

I've seen this r10 weirdness a few times, so let's see if anyone knows what's up.

Take this simple function:

#define SZ 4

void sink(uint64_t *p);

void andpop(const uint64_t* a) {
    uint64_t result[SZ];
    for (unsigned i = 0; i < SZ; i++) {
        result[i] = a[i] + 1;
    }

    sink(result);
}

It just adds 1 to each of the 4 64-bit elements of the passed-in array and stores it in a local and calls sink() on the result (to avoid the whole function being optimized away).

Here's the corresponding assembly:

andpop(unsigned long const*):
        lea     r10, [rsp+8]
        and     rsp, -32
        push    QWORD PTR [r10-8]
        push    rbp
        mov     rbp, rsp
        push    r10
        sub     rsp, 40
        vmovdqa ymm0, YMMWORD PTR .LC0[rip]
        vpaddq  ymm0, ymm0, YMMWORD PTR [rdi]
        lea     rdi, [rbp-48]
        vmovdqa YMMWORD PTR [rbp-48], ymm0
        vzeroupper
        call    sink(unsigned long*)
        add     rsp, 40
        pop     r10
        pop     rbp
        lea     rsp, [r10-8]
        ret

It's hard to understand almost everything that is going on with r10. First, r10 is set to point to rsp + 8, then push QWORD PTR [r10-8], which as far as I can tell pushes a copy of the return address on the stack. Following that, rbp is set up as normal and then finally r10 itself is pushed.

To unwind all this, r10 is popped off of the stack and used to restore rsp to its original value.

Some observations:

  • Looking at the entire function, all of this seems like a totally roundabout way of simply restoring rsp to it's original value before ret - but the usual epilog of mov rsp, rpb would do just as well (see clang)!
  • That said, the (expensive) push QWORD PTR [r10-8] doesn't even help in that mission: this value (the return address?) is apparently never used.
  • Why is r10 pushed and popped at all? The value isn't clobbered in the very small function body and there is no register pressure.

What's up with that? I've seen it several times before, and it usually wants to use r10, sometimes r13. It seems likely that has something to do with aligning the stack to 32 bytes, since if you change SZ to be less than 4 it uses xmm ops and the issue disappears.

Here's SZ == 2 for example:

andpop(unsigned long const*):
        sub     rsp, 24
        vmovdqa xmm0, XMMWORD PTR .LC0[rip]
        vpaddq  xmm0, xmm0, XMMWORD PTR [rdi]
        mov     rdi, rsp
        vmovaps XMMWORD PTR [rsp], xmm0
        call    sink(unsigned long*)
        add     rsp, 24
        ret

Much nicer!

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
  • I almost thought it was some kind of stack smashing protection, since it effectively stores an adjusted pointer to the old value of `rsp` onto the stack, and smashing the stack would clobber this value, such that the stack wouldn't be restored correctly and a maliciously overwritten return value wouldn't work - but (a) gcc can easily prove there is no smashing here and (b) it just changes the values you have to write but doesn't prevent any attack. – BeeOnRope Jul 31 '17 at 19:01
  • `sink` does something much worse than just prevent the function from being optimized away, it forces `result` to be in memory. – harold Jul 31 '17 at 19:10
  • @harold - sure, yes. I want to do the function to _do_ something, after all! In this case, it prevents the function from being optimized away by requiring the that the buffer actually be written (keeping it close to the C++) code. Here's [a version](https://godbolt.org/g/enf33U) that gets rid of `sink()` and just does `return result[0]`. No more writes to the stack _at all_ but the same weirdness with `r10`! – BeeOnRope Jul 31 '17 at 19:15
  • Related: [Why is gcc generating an extra return address?](https://stackoverflow.com/q/38781118) – Peter Cordes Jun 07 '20 at 09:09

1 Answers1

2

Well, you answered your question: The stack pointer needs to be aligned to 32 bytes before it can be accessed with aligned AVX2 loads and stores, but the ABI only provides 16 byte alignment. Since the compiler cannot know how much the alignment is off, it has to save the stack pointer in a scratch register and restore it afterwards. But the saved value has to outlive the function call, so it has to be put on the stack, and a stack frame has to be created.

Some x86-64 ABIs have a red zone (a region of the stack below the stack pointer which is not used by signal handlers), so it is feasible not to change the stack pointer at all for such short functions, but GCC apparently does not implement this optimization and it would not apply here anyway because of the function call at the end.

In addition, the default stack alignment implementation is rather poor. For this case, -maccumulate-outgoing-args results in better-looking code with GCC 6, just aligning RSP after saving RBP, instead of copying the return address before saving RBP:

andpop:
        pushq   %rbp
        movq    %rsp, %rbp            # make a traditional stack frame
        andq    $-32, %rsp            # reserve 0 or 16 bytes
        subq    $32, %rsp

        vmovdqu (%rdi), %xmm0         # split unaligned load from tune=generic
        vinserti128     $0x1, 16(%rdi), %ymm0, %ymm0   # use -march=haswell instead
        movq    %rsp, %rdi
        vpaddq  .LC0(%rip), %ymm0, %ymm0
        vmovdqa %ymm0, (%rsp)

        vzeroupper
        call    sink@PLT
        leave
        ret

(editor's note: gcc8 and later make asm like this by default (Godbolt compiler explorer with gcc8, clang7, ICC19, and MSVC), even without -maccumulate-outgoing-args)


This issue (GCC generating poor code for stack alignment) recently came up when we had to implement a workaround for GCC __tls_get_addr ABI bug, and we ended up writing the stack realignment by hand.

EDIT There is also another issue, related to RTL pass ordering: stack alignment is picked before the final determination whether the stack is actually needed, as BeeOnRope's second example shows.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Florian Weimer
  • 32,022
  • 3
  • 48
  • 92
  • 1
    Yes, gcc wants to align the stack and it has "something to do with that", but it doesn't explain all the craziness going on (see the bullet points). Note that `clang` aligns the stack in the same way but has none of this madness. – BeeOnRope Jul 31 '17 at 19:11
  • 1
    Well, a shorter way of putting it is that `-mno-accumulate-outgoing-args` is buggy as far as stack realignment goes, and enabled by default. The `-maccumulate-outgoing-args` case looks pretty decent to me, considering the constraints. – Florian Weimer Jul 31 '17 at 19:16
  • 1
    Thanks. FWIW, it's the not the function call at the end that triggers the issue. [Here's a version](https://godbolt.org/g/enf33U) which doesn't write to the stack _at all_ and has no function call at all, and has the same weirdness. – BeeOnRope Jul 31 '17 at 19:18
  • Ugh, okay, that certainly *is* buggy, and it happens with `-maccumulate-outgoing-args` as well. The issue seems to be that `result` still lives in memory after the GIMPLE optimization, and is eliminated only by one of the RTL passes (DSE?), apparently after the frame alignment has already been determined. – Florian Weimer Jul 31 '17 at 19:32
  • I guess the other obvious question is why doesn't `gcc` just use `rbp` these cases to preserve the original `rsp`, which is already ready-made for that purpose and here is set up anyway? – BeeOnRope Aug 01 '17 at 00:46
  • Sorry, I do not understand. `rbp` itself is callee-saved, so it needs to be preserved too. – Florian Weimer Aug 02 '17 at 17:54
  • Right, `rbp` is pushed onto the stack, and then `rsp` is moved into `rbp` to set up a classic stack frame. So you already know the original value of `rsp` and can use `mov rsp, rbp` in the epilogue to restore it (exactly like `clang` does). So the whole action of saving `rsp` in `r10` is somewhat pointless because `rbp` serves the same function. Its like `gcc` decides independently to set up a normal stack frame (`push rbp; mov rbp, rsp` and friends) and _also_ independently to save `rsp` in `r10`, but it only needs one of those. – BeeOnRope Aug 02 '17 at 18:09
  • 1
    Oh, right. Curiously, GCC generates much better code for the similar `-mstackrealign -msse2` scenario on i386 (where plain old SSE2 exceeds the original ABI alignment, too). – Florian Weimer Aug 02 '17 at 18:11