RET x versus ADD RSP, x in x86-64 assembly

Question

I am writing a program in MASM64.
I use WinAPI a lot.
I don't use the push and pop instructions, I use mov [rsp + x] instead.
I don't use local variables.
I don't use prolog/epilog.
I don't use RBP at all.
I do use sub rsp, x to preserve a shadow space and keep the stack 16 bytes aligned.

What is the difference between ending a procedure with ret x vs add rsp, x? I understand they both add value to RSP to clean up the stack. Any performance difference?
I guess ret x would be faster, since after add rsp, x there will be a ret anyway.

`ret x` pops the return address off the stack first, then adds `x` to RSP then continues execution at the address popped off the stack. `ret x` is used in calling conventions where the callee (the function being called) cleans up the parameters passed by the caller. `add rsp, x` `ret` is not the same thing as `ret x` since `add rsp, x` `ret` adjusts the stack first then the return address popped off the stack and then control is transferred to that address. — Michael Petch, Feb 28 '23 at 13:34
Note: `ret 0` and `ret` effectively do the same thing since RSP is adjusted by 0 (doesn't adjust RSP). — Michael Petch, Feb 28 '23 at 13:45
Thanks Michael. Besides the order of execution, is there another difference? It sounds to me that the end result is the same so ret x will be faster. Am I wrong? — Danny Cohen, Mar 01 '23 at 03:27
You completely misunderstood what Michael wrote. The end result is not the same. — prl, Mar 01 '23 at 05:41
What don't I understand? Will the value of rsp be different? — Danny Cohen, Mar 01 '23 at 07:35
RSP will be the same (for now), but the value of RIP will be different because you pop a different stack slot as the return address! Returning to the correct place is important, so it's not a drop-in replacement. (Also, returning to the wrong place will cause a mispredict of the call/ret predictor, so even the ret itself will be slower.) — Peter Cordes, Mar 01 '23 at 07:39
Now I understand. So the problem is in the return address, not the rsp value. Thanks a lot Peter! — Danny Cohen, Mar 01 '23 at 08:23
*I don't use local variables.* - That's weird. You never run out of registers in your functions? If you use global variables instead, that's generally less efficient since the addressing mode to reach a global is longer than `[rsp+4]` or whatever. And IIRC, instructions like `mov dword ptr [foo], 1` can't micro-fuse the store-address + store-data uops in an instruction with a RIP-relative addressing mode an an immediate, on Intel CPUs. — Peter Cordes, Mar 03 '23 at 09:59
Interesting point @Peter, I never run out of registers since I write small modular procedures and my CPU supports ZMM so I have 32 512 bit registers. — Danny Cohen, Mar 03 '23 at 10:05
*I write small modular procedures* - So you have call/ret overhead, and you're probably pushing and popping call-preserved registers to the stack, so it's similar to having local variables in terms of loading/storing state to the stack, but with worse register allocation. One of the major benefits of compilers inlining small functions to make larger functions is avoiding that overhead. (It's not common to run out of registers in a loop or something on x86-64, though, for something worth hand-writing in asm. But stack space for scratch arrays can be useful.) — Peter Cordes, Mar 03 '23 at 10:16
Good points @Peter, I am still a novice X64 coder so I will take your points into consideration. — Danny Cohen, Mar 03 '23 at 10:19
Looking at compiler output (and thinking of ways you could do better) is often a good way to learn more about what's efficient. Not inlining a function containing a loop that takes significant time compared to call/ret overhead is usually fine, although if the caller passes constants as some of the args, and the function could simplify significantly for that case, then inlining would be a big win. That's another reason compilers do it. (And the maintenance nightmare of tweaking 5 versions of the every function specialized for each call-site is why we use compilers instead of hand-writing asm) — Peter Cordes, Mar 03 '23 at 10:23
See [How to remove "noise" from GCC/clang assembly output?](https://stackoverflow.com/q/38552116) for more about doing that, and functions that compile to asm that's interesting to look at. — Peter Cordes, Mar 03 '23 at 10:25

score 4 · Accepted Answer · edited Mar 02 '23 at 17:52

Let's say on the stack we have 55, 44, 33, 22, 11 (top to bottom) and we have 64-bit integers.

In this case, ret 0x20 would return to address 55 and then remove the other four values from the stack, while add rsp, 0x20; ret would first remove four values from the stack and then return to address 11.

In both cases, a total of five values gets removed from the stack, but the return address is different in the two cases.

Since you said that you use this for local variables for which you created space using sub rsp, x from within your function, you'll need the latter case (add rsp, x; ret). The other case is for removing function arguments which the caller pushed before calling the function.

Neither mainstream x86-64 calling convention uses a callee-pops convention, so x86-64 code uses plain ret, leaving the stack args allocated. The caller can reuse that space for another call.

A few 32-bit conventions are callee-pops, like stdcall and fastcall, using ret n. i386 System V and cdecl are caller-pops conventions.

RET x versus ADD RSP, x in x86-64 assembly

1 Answers1