Inline assembly that clobbers the red zone

Question

I'm writing a cryptography program, and the core (a wide multiply routine) is written in x86-64 assembly, both for speed and because it extensively uses instructions like adc that are not easily accessible from C. I don't want to inline this function, because it's big and it's called several times in the inner loop.

Ideally I would also like to define a custom calling convention for this function, because internally it uses all the registers (except rsp), doesn't clobber its arguments, and returns in registers. Right now, it's adapted to the C calling convention, but of course this makes it slower (by about 10%).

To avoid this, I can call it with asm("call %Pn" : ... : my_function... : "cc", all the registers); but is there a way to tell GCC that the call instruction messes with the stack? Otherwise GCC will just put all those registers in the red zone, and the top one will get clobbered. I can compile the whole module with -mno-red-zone, but I'd prefer a way to tell GCC that, say, the top 8 bytes of the red zone will be clobbered so that it won't put anything there.

Just an untested though, but can't you just specify an extra dummy input, such that GCC puts it in the red zone and it gets (harmlessly) clobbered? — Tony Delroy, Jun 17 '11 at 05:55
Hm. Probably not reliably. I've found that it's pretty hard to control what GCC spills to the stack, when and where. It other crypto stuff I've written, I've tried with mixed success to suppress GCC's tendency to write, eg, entire key tables to the stack for little reason. — Mike Hamburg, Jun 18 '11 at 07:49
How about defining the crypto routine as a macro (using top level asm at the top of the file)? Then invoking it (as opposed to `call`ing it) from several places within your C code via extended asm is slightly less horrible (although it does bloat the executable). You can still clobber all the registers, but the stack is unaffected. BTW, how does the crypto know what to crypt? Accessing globals via inline can be tricky. Also, clobbering sp has [no effect](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52813). — David Wohlferd, Jun 27 '16 at 01:41

score 5 · Answer 1 · answered Jun 18 '11 at 19:32

From your original question I did not realize gcc limited red-zone use to leaf functions. I don't think that's required by the x86_64 ABI, but it is a reasonable simplifying assumption for a compiler. In that case you only need to make the function calling your assembly routine a non-leaf for purposes of compilation:

int global;

was_leaf()
{
    if (global) other();
}

GCC can't tell if global will be true, so it can't optimize away the call to other() so was_leaf() is not a leaf function anymore. I compiled this (with more code that triggered stack usage) and observed that as a leaf it did not move %rsp and with the modification shown it did.

I also tried simply allocating more than 128 bytes (just char buf[150]) in a leaf but I was shocked to see it only did a partial subtraction:

    pushq   %rbp
    movq    %rsp, %rbp
    subq    $40, %rsp
    movb    $7, -155(%rbp)

If I put the leaf-defeating code back in that becomes subq $160, %rsp

There's `__attribute__(leaf)` but unfortunately there's nothing like `__attribute__(nonleaf)` — Ben Voigt, Jun 18 '11 at 20:57
I don't find it shocking that gcc doesn't give up on the red-zone when it has to reserve some stack space: one of the benefits of the red-zone is being able to reach more memory with disp8 displacements, so having `rsp` in the middle of the locals means it can reach all of them with `[rsp-128..+127]` addressing modes. It's a good optimization. (Or it would have been if you'd used `-O3` + `volatile char buf[150]` to get an RSP-relative addressing mode instead of `-155(%rbp)`) — Peter Cordes, Feb 20 '18 at 00:31

Peter Cordes · Answer 2 · 2020-11-02T09:06:04.107

The max-performance way might be to write the whole inner loop in asm (including the call instructions, if it's really worth it to unroll but not inline. Certainly plausible if fully inlining is causing too many uop-cache misses elsewhere).

Anyway, have C call an asm function containing your optimized loop.

BTW, clobbering all the registers makes it hard for gcc to make a very good loop, so you might well come out ahead from optimizing the whole loop yourself. (e.g. maybe keep a pointer in a register, and an end-pointer in memory, because cmp mem,reg is still fairly efficient).

Have a look at the code gcc/clang wrap around an asm statement that modifies an array element (on Godbolt):

void testloop(long *p, long count) {
  for (long i = 0 ; i < count ; i++) {
    asm("  #    XXX  asm operand in %0"
    : "+r" (p[i])
    :
    : // "rax",
     "rbx", "rcx", "rdx", "rdi", "rsi", "rbp",
      "r8", "r9", "r10", "r11", "r12","r13","r14","r15"
    );
  }
}

#gcc7.2 -O3 -march=haswell

    push registers and other function-intro stuff
    lea     rcx, [rdi+rsi*8]      ; end-pointer
    mov     rax, rdi
   
    mov     QWORD PTR [rsp-8], rcx    ; store the end-pointer
    mov     QWORD PTR [rsp-16], rdi   ; and the start-pointer

.L6:
    # rax holds the current-position pointer on loop entry
    # also stored in [rsp-16]
    mov     rdx, QWORD PTR [rax]
    mov     rax, rdx                 # looks like a missed optimization vs. mov rax, [rax], because the asm clobbers rdx

         XXX  asm operand in rax

    mov     rbx, QWORD PTR [rsp-16]   # reload the pointer
    mov     QWORD PTR [rbx], rax
    mov     rax, rbx            # another weird missed-optimization (lea rax, [rbx+8])
    add     rax, 8
    mov     QWORD PTR [rsp-16], rax
    cmp     QWORD PTR [rsp-8], rax
    jne     .L6

  # cleanup omitted.

clang counts a separate counter down towards zero. But it uses load / add -1 / store instead of a memory-destination add [mem], -1 / jnz.

You can probably do better than this if you write the whole loop yourself in asm instead of leaving that part of your hot loop to the compiler.

Consider using some XMM registers for integer arithmetic to reduce register pressure on the integer registers, if possible. On Intel CPUs, moving between GP and XMM registers only costs 1 ALU uop with 1c latency. (It's still 1 uop on AMD, but higher latency especially on Bulldozer-family). Doing scalar integer stuff in XMM registers is not much worse, and could be worth it if total uop throughput is your bottleneck, or it saves more spill/reloads than it costs.

But of course XMM is not very viable for loop counters (paddd/pcmpeq/pmovmskb/cmp/jcc or psubd/ptest/jcc are not great compared to sub [mem], 1 / jcc), or for pointers, or for extended-precision arithmetic (manually doing carry-out with a compare and carry-in with another paddq sucks even in 32-bit mode where 64-bit integer regs aren't available). It's usually better to spill/reload to memory instead of XMM registers, if you're not bottlenecked on load/store uops.

If you also need calls to the function from outside the loop (cleanup or something), write a wrapper or use add $-128, %rsp ; call ; sub $-128, %rsp to preserve the red-zone in those versions. (Note that -128 is encodeable as an imm8 but +128 isn't.)

Including an actual function call in your C function doesn't necessarily make it safe to assume the red-zone is unused, though. Any spill/reload between (compiler-visible) function calls could use the red-zone, so clobbering all the registers in an asm statement is quite likely to trigger that behaviour.

// a non-leaf function that still uses the red-zone with gcc
void bar(void) {
  //cryptofunc(1);  // gcc/clang don't use the redzone after this (not future-proof)

  volatile int tmp = 1;
  (void)tmp;
  cryptofunc(1);  // but gcc will use the redzone before a tailcall
}

# gcc7.2 -O3 output
    mov     edi, 1
    mov     DWORD PTR [rsp-12], 1
    mov     eax, DWORD PTR [rsp-12]
    jmp     cryptofunc(long)

If you want to depend on compiler-specific behaviour, you could call (with regular C) a non-inline function before the hot loop. With current gcc / clang, that will make them reserve enough stack space since they have to adjust the stack anyway (to align rsp before a call). This is not future-proof at all, but should happen to work.

GNU C has an __attribute__((target("options"))) x86 function attribute, but it's not usable for arbitrary options, and -mno-red- zone is not one of the ones you can toggle on a per-function basis, or with #pragma GCC target ("options") within a compilation unit.

You can use stuff like

__attribute__(( target("sse4.1,arch=core2") ))
void penryn_version(void) {
  ...
}

but not __attribute__(( target("mno-red-zone") )).

There's a #pragma GCC optimize and an optimize function-attribute (both of which are not intended for production code), but #pragma GCC optimize ("-mno-red-zone") doesn't work either. I think the idea is to let some important functions be optimized with -O2 even in debug builds. You can set -f options or -O.

You could put the function in a file by itself and compile that compilation unit with -mno-red-zone, though. (And hopefully LTO will not break anything...)

score 2 · Answer 3 · answered Jun 17 '11 at 22:38

2

Can't you just modify your assembly function to meet the requirements of a signal in the x86-64 ABI by shifting the stack pointer by 128 bytes on entry to your function?

Or if you are referring to the return pointer itself, put the shift into your call macro (so sub %rsp; call...)

answered Jun 17 '11 at 22:38

Ben Jackson

90,079
9
98
150

3

I can't do it from within the function itself, because `call` uses the stack and therefore breaks things by itself. `sub $128, %rsp; call...; add $128, %rsp` works, but it's less than ideal. I guess on balance it's probably best just to make my function meet the ABI. – Mike Hamburg Jun 18 '11 at 07:44

score 0 · Answer 4 · answered Jul 02 '13 at 04:08

0

What about creating a dummy function that is written in C and does nothing but call the inline assembly?

answered Jul 02 '13 at 04:08

Demi

3,535
5
29
45

And marking that function as `__attribute__((noinline))`? With `-O0`, the compiler might still spill function args to the red-zone. – Peter Cordes Dec 21 '17 at 15:23

score 0 · Answer 5 · answered Jun 17 '11 at 22:31

0

Not sure but looking at GCC documentation for function attributes, I found the stdcall function attribute which might be of interest.

I'm still wondering what you find problematic with your asm call version. If it's just aesthetics, you could transform it into a macro, or a inline function.

answered Jun 17 '11 at 22:31

Mathias Brossard

3,668
2
26
30

3

The `call` instruction pushes the current instruction pointer to the stack. This is fine if there's nothing under the stack (in the "red zone"), but on x86-64, the ABI allows the compiler to put stuff there in leaf functions, i.e. those that don't call anything. However, GCC doesn't see this `call` as a call, because it's hidden in inline asm. So it might put something in the red zone, and it'll get clobbered by the call. This isn't just a theoretical possibility, it actually happens and actually caused a bug in my code. Also, stdcall doesn't do that. – Mike Hamburg Jun 18 '11 at 07:38
Specifically the problem with stdcall is that it only works on actual, non-inlined functions. But to define a custom calling convention for my function, I'm trying to call it via inline asm. So GCC doesn't realize it's a function call at all (which is the problem in the first place), and thus I can't attach attributes to it. – Mike Hamburg Jun 18 '11 at 07:53

Inline assembly that clobbers the red zone

5 Answers5

Linked