26

Say I'm writing a routine in x86 assembly, like, "add" which adds two numbers passed as arguments.

For the most part this is a very simple method:

push ebp
mov ebp, esp
mov eax, [ebp+8]
add eax, [ebp+12]
mov esp, ebp
pop ebp
ret

But, is there any way I could rewrite this method to avoid the use of the "ret" instruction and still have it produce the exact same result?

Govind Parmar
  • 20,656
  • 7
  • 53
  • 85
  • 3
    And... why would you want to write a routine without using RET? – mcleod_ideafix Nov 21 '13 at 18:38
  • 10
    @mcleod_ideafix: this routine certainly is convenient to write with RET. However, it is quite common in assembler to have various values in the stack when the function is entered; sometime it is convenient to pick the return address out of the stack, clean up the stack with an lea instruction, and then jmp indirect. I do a *lot* of really high-performance assembler coding in a runtime system for a parallel programming langauge; this occurs more often then you'd expect. And if you are writing really pure vanilla assembly code, why are you writing assembly code at all? – Ira Baxter Nov 21 '13 at 22:11
  • 2
    you can do that with other instructions but the performance will not be optimize because of the reason [here](http://blogs.msdn.com/b/oldnewthing/archive/2004/12/16/317157.aspx) – phuclv Apr 16 '15 at 17:26
  • 1
    [Does ret instruction cause esp register added by 4?](https://stackoverflow.com/q/4292447) is a better link for showing beginners exactly what `ret` does: it's how x86 spells `pop eip`. Also [Does it matter where the ret instruction is called in a procedure in x86 assembly](https://stackoverflow.com/q/46714626). The accepted answer here is over-complicated by keeping all registers unmodified (including ones that are call-clobbered in standard calling conventions), but mutates static storage instead. – Peter Cordes Mar 10 '21 at 06:29

5 Answers5

31

Sure.

push ebp
mov ebp, esp
mov eax, [ebp+8]
add eax, [ebp+12]
mov esp, ebp
pop ebp

pop ecx  ; these two instructions simulate "ret"
jmp ecx

This assumes you have a free register (e.g, ecx). Writing an equivalent that uses "no registers" is possible (after all the x86 is a Turing machine) but is likely to include a lot of convoluted register and stack shuffling.

Most current OSes offer thread-specific storage accessible by one of the segment registers. You could then simulate "ret" this way, safely:

 pop   gs:preallocated_tls_slot  ; pick one
 jmp   gs:preallocated_tls_slot
Ira Baxter
  • 93,541
  • 22
  • 172
  • 341
  • 2
    Something like `add esp,4 / jmp [esp-4]` maybe? By the way, shouldn't the jump in your version be `jmp ecx` ? – Michael Nov 21 '13 at 18:36
  • 5
    Yes, should have been "jmp ecx", fixed. Your "add esp,4/jmp [esp-4]" doesn't simulate what ret does; it is in fact completely unsafe. There is no gaurantee that an interrupt won't happen between the "add esp,4" and the "jmp [esp-4]", which would trash your return address. – Ira Baxter Nov 21 '13 at 20:09
  • 1
    @Griwes: How does that "sti" instruction get executed? The jmp happens first. And, this doesn't prevent your friendly native OS from interrupting you. – Ira Baxter Nov 21 '13 at 22:08
  • Could you not pop the stack value directly into the instruction IP register? – Dean P Jan 15 '21 at 09:37
  • @DeanP: There isn't any "pop pc" instruction that I know about. There isn't any reason to have one: that's what "ret" does, which was the point of the question. – Ira Baxter Jan 15 '21 at 14:13
  • @IraBaxter, maybe you meant "after all the x86 without `ret` is a Turing machine" ? – InfiniteLooper Jan 13 '23 at 18:14
  • @InfiniteLooper: Yes, that's what I meant, otherwise you could just use the "ret" instruction which defeats the point. Frankly, you can take away most of the instruction set (in lots of ways) of the x86 and its still Turing capable. – Ira Baxter Jan 13 '23 at 18:56
8

This does not need any free registers to simulate ret, but it needs 4 bytes of memory (a dword). Uses indirect jmp. Edit: As noted by Ira Baxter, this code is not reentrant. Works fine in single-threaded code. Will crash if used in multithreaded code.

push ebp
mov  ebp, esp
mov  eax, [ebp+8]
add  eax, [ebp+12]
mov  ebp, [ebp+4]
mov  [return_address], ebp
pop  ebp

add  esp,4
jmp  [return_address]

.data
return_address dd 0

To replace only the ret instruction, without changing the rest of the code. Not reentrant. Do not use in multithreaded code. Edit: fixed bug in below code.

push ebp
mov  ebp, esp
mov  ebp, [ebp+4]
mov  [return_address], ebp
pop  ebp

add  esp,4
jmp  [return_address]

.data
return_address dd 0
nrz
  • 10,435
  • 4
  • 39
  • 71
  • 2
    That works, but it isn't reentrant and would be disasterous in any kind of multithreaded code. – Ira Baxter Nov 21 '13 at 22:07
  • @IraBaxter Thanks for the advice. Appropriate warning added. – nrz Nov 21 '13 at 22:14
  • Another caution: if you modify the instruction stream, the modification may not be seen by the CPU. The implementation-dependent instruction fetcher grabs a chunk of code periodically; it may grab code that you are about to modify, and then the modificications won't be seen, and a crash will result. So as written, this is pretty unsafe. There are instructions to force the instruction stream to be refetched, to handle this case. Frankly, using the registers is small, better and faster. – Ira Baxter Feb 29 '16 at 23:22
  • 4
    @MurilloHenrique Because the return address is placed on a fixed position in memory. This is fixed for the entire program, so another thread may come in, run the routine, messing up execution for the other thread which may also happen to be in the routine. – Sebazzz Feb 19 '17 at 21:55
  • @Sebazzz so the solution would be to store the return address into a register? I think I get the point, thank you! – Murillo Ferreira Feb 20 '17 at 18:28
  • 3
    @MurilloHenrique You could also store the return address somewhere in thread local storage. – yyny Nov 21 '17 at 16:56
6

Some other answers present ideas for avoiding registers entirely. This is slower and usually not needed.

(Much slower if you don't have a red-zone below ESP/RSP you can use, like the x86-64 System V ABI guarantees for user-space. But no other x86/x86-64 ABIs guarantee a red-zone, so debuggers evaluating a print some_func(123) while stopped at a breakpoint could clobber space below ESP, or a Unix signal handler. See Is it valid to write below ESP? for more about the safety of data below ESP, especially on Windows.)


In typical 32-bit calling conventions, EAX, ECX, and EDX, are all call-clobbered. (i386 System V, and all of Windows cdecl, stdcall, fastcall, etc.)

The Irvine32 calling convention has no call-clobbered registers, that's the one case I know of where this won't work.

So unless you're using a custom calling convention that returns something in ECX, you can safely replace ret with pop ecx/jmp ecx and still produce "the exact same result" and fully obey the calling convention. (64-bit integers are returned in EDX:EAX, so in some functions you can't clobber EDX).

add:
    mov   eax, [esp+4]
    add   eax, [esp+8]
    ;;ret
    pop   ecx
    jmp   ecx           ; bad performance: misaligns the return address predictor stack

I also removed the stack-frame overhead / noise for readability.

ret is basically how you write pop eip (or IP / RIP) in x86, so popping into an architectural register and using a register-indirect jump is architecturally equivalent. (But much worse microarchitecturally because of call/ret special handling for branch prediction.)


To avoid registers, in a function with a stack arg, we can overwrite one of the args. In the standard calling conventions, functions own their incoming args and can use those arg-passing slots as scratch space, even if they're declared as foo(const int a, const int b).

add:
    mov   eax, [esp+4]    ; arg1
    add   eax, [esp+8]    ; arg2
    ;;ret
    pop   [esp]           ; copy return address to arg1, and do ESP+=4
    jmp   [esp]           ; ESP is pointing to arg1

This wouldn't work for a function with no args, or with only register args. (Except in Windows x64, where you could copy the retaddr into the 32-byte shadow space above the return address.)

Despite the pseudocode in the Operation section in Intel's ISA manual (https://www.felixcloutier.com/x86/pop) showing DEST ← SS:ESP; happens before ESP += 4, the Description section says "If the ESP register is used as a base register for addressing a destination operand in memory, the POP instruction computes the effective address of the operand after it increments the ESP register." Also that "POP ESP increments the stack pointer (ESP) before data at the old top of stack is written into the destination." So it's really tmp = pop ; dst = tmp. AMD doesn't mention either corner-case at all.

If I'd left in the legacy stack-frame crap with EBP, I could have avoided an [ESP] destination pop, using EBP as a temporary before restoring it. mov ebp, [ebp+4] / mov [esp+8], ebp / pop ebp / add esp,4 / jmp [esp], but that's hardly better or easier to follow. (The saved EBP value is below the return address, and you can't safely move ESP up past it either.) And this temporarily breaks legacy backtraces following a chain of EBP pointing to saved-EBP.

Or you could save / restore another register to use as a temporary for copying the return address over an arg. But that seems pointless vs. pop [esp] once you sort out exactly what that does.


Avoiding RET is terrible for performance

(Unless your caller also avoided call, manually pushing a return address.)

Mismatched call/ret lead to bad performance for future ret instructions going back up the call-stack in parent functions.

See Microbenchmarking Return Address Branch Prediction, and also Agner Fog's microarch and optimization guides. Specifically the part that's quoted and discussed in Return address prediction stack buffer vs stack-stored return address?

(Fun fact: most CPUs special case call +0, because it's not rare for code to use call next_instruction / pop ebx as part of for position-independent 32-bit code to work around the lack of RIP-relative addressing. See the stuffedcow.net blog post.)

Note that a tailcall like jmp add instead of call add / ret is fine: that doesn't cause a mismatch because the first ret is returning to the most recent call (in the parent of the function that ended with a tailcall). You could look at it as making the body of the 2nd function "part of" the function that did the tailcall, as far as call / ret is concerned.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    Related: re: pseudocode for `pop` that works even for `pop esp` and `pop dword [esp+eax]`, unlike Intel's manuals: [What is an assembly-level representation of pushl/popl %esp?](https://stackoverflow.com/a/69489798) – Peter Cordes Feb 01 '22 at 14:20
5

Haven't tested, but you may be able to do a ret without using a GPR like this:

add esp,4
jmp dword ptr [esp-4]
mcleod_ideafix
  • 11,128
  • 2
  • 24
  • 32
  • 7
    Not safe; an interrupt can trash the return address. If you "tested" this it would look like it works, but.... – Ira Baxter Nov 21 '13 at 20:10
  • btw: dowrd->dword ;-) – Artur Nov 21 '13 at 21:07
  • 2
    @Ira Baxter: In user space (CPL = 3) this should work because interrupts use the CPL0 stack (separate ESP) - assuming that there is no "signal" in Linux that would also destroy the stack. – Martin Rosenau Nov 21 '13 at 21:27
  • 1
    @MartinRosenau: I thought any interrupt (OS, trap, whatever) pushed at least 3 words on the *current* stack before any kind of stack switch for tasks tool place. Can you quote chapter and verse in some Intel document? – Ira Baxter Nov 21 '13 at 22:13
  • 1
    @IraBaxter: interrupts push their return info onto the *kernel* stack. Doing so on the user-space stack would let a multithreaded process take over the kernel by having another thread modify that stack memory which is mapped into the user-space process. This is safe in practice on current Windows, but explicitly documented as *unsafe* [Is it valid to write below ESP?](https://stackoverflow.com/q/52258402). It is explicitly safe on x86-64 System V, where the ABI guarantees a red-zone below RSP for user-space code. (The HW behaviour you describe makes a red zone impossible for kernel code.) – Peter Cordes Oct 08 '18 at 22:13
  • @PeterCordes: What you say makes sense on a thread running in a protected address space, agreed. I'm old school; I built lots of code on bare microprocessors where I was running essentially in "kernal mode" and this kind of surprise was almost always an issue. – Ira Baxter Oct 08 '18 at 23:13
  • @IraBaxter: Yeah, being insulated from HW interrupts is a big change. The user-space equivalent is Unix signal handlers, which can asynchronously clobber space below ESP (in a process that has any signal handlers registers). That's why the x86-64 SysV ABI needed to define a red-zone (which the kernel's mechanism for invoking signal handlers must respect) to make 128 bytes below RSP safe from async clobbers in user-space. – Peter Cordes Oct 08 '18 at 23:20
0

This is possible to make the return_address an array of dwords and let each thread access return_address at an unique index computed by an one to one injective function of it's unique identifier.

This change makes nrz's accepted answer works also for multithreaded code as well!

  • 1
    How can you index an array without any data in registers? If you use a register, you might as well `pop ecx` / `jmp ecx`. I think the only way this actually works is Ira Baxter's suggestion to use thread-local storage like `pop gs:preallocated_tls_slot`. (Which also has the advantage of avoiding false sharing between threads all using different dwords in the same cache line at the same time.) – Peter Cordes Feb 21 '19 at 21:44
  • Every thread can theoretically invokes or calls a system procedure or function that overwrites the contents of register eax with the identifier of the calling thread. Then simply `lea ebx,return_address` and then `dword ptr [ebx+eax]` supposed to be the return address of the current thread. So simply do `jmp [ebx+eax]`. Also do `mov dword ptr [ebx+eax],ebp` before as **nrz** suggests in his/her accepted answer. – user11095906 Mar 03 '19 at 17:32
  • 1
    `ret` doesn't step on EAX or EBX, but your version does. I see zero advantage to this vs. ``pop eax` / `jmp eax`. The only reason for considering memory at all is to indirect jmp *without* modifying any registers (other than ESP), to fully emulate `ret` even for a custom calling convention where all registers hold return values. (And obviously EAX is the worst possible choice, because it's the return-value register in all normal calling conventions. And EBX is call-preserved. But lets say you had your helper return in ECX, and you did `jmp [return_address+ecx*4]`. Still pointless.) – Peter Cordes Mar 03 '19 at 19:24
  • This question is pointless. No assembly programmer and no assembler and compiler will ever use and produce any code that is equivalent to a single `ret` instruction instead of using and producing a single `ret` instruction. Also OP didn't ask for efficiency and advantage at all, so I didn't take care of it at all. – user11095906 Mar 03 '19 at 23:01
  • 1
    Agreed that it's pointless, except maybe for understanding retpolines / Spectre v2 mitigation for indirect branches. Or for understanding that `ret` isn't magic, and is really just `pop eip`, so ESP has to be pointing to the right place. – Peter Cordes Mar 03 '19 at 23:06
  • OP could at least try making an effort to search through google and read about the `ret` instruction by himself/herself, understand and by himself/herself deduce an equivalent code to the `ret` instruction instead of asking this question and if he/she failed doing that, at least he/she had to write in his/her question what he/she didn't understand when reading some explanations and descriptions about this instruction on the web and the internet. – user11095906 Mar 03 '19 at 23:13