-2

In x86 assembly, in the usual 32-bit calling conventions, it's normal to push function args and clean up the stack with an add afterwards. Are there any useful alternatives, e.g. using MOV instead of PUSH?

For example, is there another efficient way to do the following?

PUSH   10
PUSH   20
CALL   plus     ; int plus(int,int) adds two integer args, leaving EAX = 30
add    esp, 8
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
signal
  • 424
  • 6
  • 23
  • Your question is ambiguous. Replacing `push` with `mov` here can effect several things. (1) Change the calling convention and pass the arguments via registers. (2) Rewrite the `push` statement in general using `mov` and keep the calling convention. So what is it? – cadaniluk Jul 24 '16 at 11:35
  • Mov dword ptr [esp], 10 ? – ABuckau Jul 24 '16 at 11:45
  • 2
    You can't simply replace a `push` with a `mov` since the `mov` will never change the `esp`. Seeking to replace a `push` with a `mov` is more involved than your question makes it sound. Also, your code fragment isn't great with that mysterious "EAX will be 30" sitting on the end. – David Hoelzer Jul 24 '16 at 12:14
  • Okay... I had confused that there are two way to pass paremeters: using `MOV` and `PUSH`, but it's not true. In fact, `MOV` is not matched to `PUSH`. Thanks everyone. – signal Jul 24 '16 at 12:18
  • 1
    You've reworded your question but it's still not the whole story. There are multiple calling conventions, some of which use the stack exclusively, most of which use registers and the stack. – David Hoelzer Jul 25 '16 at 02:41

3 Answers3

3
sub rsp, 8
mov dword ptr[rsp+4], 10
mov dword ptr[rsp], 20
call plus

This does the exact same thing as your code, but without using push; the "translation" is straightforward given that a push is defined as "decrement the stack pointer by the operand size, then move the operand to the location pointed by the stack pointer".

Matteo Italia
  • 123,740
  • 17
  • 206
  • 299
  • 4
    You should absolutely invert the order of operations: first `sub rsp, 8` than the moves. Memory below the stack pointer belong to "others". – Margaret Bloom Jul 24 '16 at 13:17
  • 2
    @MargaretBloom: Not in the SysV AMD64 ABI (with a red zone), but yes, since you're going to subtract anyway, it's better to do it before the `mov`s. (Better mostly in the sense of avoiding confusion for humans, and not needing this explanation about how it's only safe with a red-zone). For the CPU, Matteo's sequence is potentially better because the first `mov` can execute in the same cycle as the `sub`, instead of the cycle after. – Peter Cordes Jul 24 '16 at 13:33
  • Fixing it, you are right that it's definitely less confusing when reading it (`sub rsp, something` usually reads as *"allocate" `something` from the stack*, and using before allocating does feel wrong). – Matteo Italia Jul 24 '16 at 13:40
  • 1
    @PeterCordes red zone, eh? Spoiled kids nowadays, can't even relive that 2 weeks of "why sometimes I get corrupted data there?" insanity, when somebody forgot to consider interrupts running.... :D (and no, there was no Stackoverflow to ask somebody else) :D ... red zone.. oh my.. This is sad. ;) (joking) – Ped7g Jul 24 '16 at 21:28
  • 1
    @Ped7g well, that is what the ABI document calls it... :) :) Just try to think back and remember life before google. Programming in the 70s was a different world. :) – David Hoelzer Jul 25 '16 at 11:26
3

In the original version of this question, the OP mentioned AMD64. The usual calling conventions on x86-64 pass the first few args in registers (like some 32-bit conventions such as MS's vectorcall), not on the stack, so of course you use mov there.

mov   edi, 10     ; x86-64 SysV calling convention.
mov   esi, 20
call  plus        ; leaves eax = 30.

If you're writing functions that you only call from asm, you can make up custom calling conventions on a per-function basis, so you could use code like the above even for 32-bit. (But I'd recommend choosing registers that are not call-preserved in the normal ABI. For example gcc -m32 using __attribute__((regparm(3))) on a function makes it take args in EAX, EDX, ECX, in that order). See the tag wiki for links to ABI docs.


Replacing push for putting args on the stack

If there's room above ESP, you can mov to that space instead of pushing args.

e.g. for two back-to-back calls, instead of add esp, 8 / push / push to clean the stack and push new args, you can just use mov to write new args. GCC does this with -maccumulate-outgoing-args), which some -mtune= settings enable. See this mailing list message from 2014 where Honza describes the pros and cons, and why he disabled it for the default -mtune=generic in 2014.

(Actually, gcc's -maccumulate-outgoing-args is slightly different. It reserves space for args ahead of the first call using a sub, so not even the first call uses push. This increases code size but minimizes changes to esp, so it shrinks the CFI metadata that maps instruction addresses to stack frame size, for stack-unwinding purposes with -fomit-frame-pointer. And on old CPUs where push is slower, avoids using it.)

;; Delayed arg popping.  Push is cheap since Pentium-M stack engine so we use it for the first call
push     10
push     20
call     foo
; add esp,8   ; don't do this yet, using mov instead of push
; then like gcc -maccumulate-outgoing-args
mov      dword [esp+4], 15
mov      [esp], eax
call     bar
add      esp,8                ; NOW pop the args.

This does bar( foo(20,10), 15), and leaves the stack pointing to the same place as before you started. Popping the stack (with add) and then pushing new args may do better on modern CPUs. This saves instructions but not code-size. It may save 1 uop.

On Intel's stack engine, an extra stack-sync uop is needed before the mov [esp+4] or before add esp, 4, so that cost in front-end uops is the same either way. The only way to avoid it would be to push new args, and do one big cleanup at the end (add esp, 16). But that risks more cache misses from touching new stack space. (See the tag wiki for more optimization docs).

This does minimize uop count for the uop-cache; the stack-sync uop is generated on the fly. But mov tends to be larger than push, especially for small immediate constants (2-byte push 1 vs. 8-byte mov dword [esp+4], 1), so I-cache footprint may be better with add esp,8 / 2x push.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 2
    Avoiding PUSH was a big deal back in the day, not only because of the poor stack engine, but also because PUSH didn't pair. It's still slower to PUSH at least through the P4. Your answer leads me to believe that's changed on modern processors, but I wasn't aware of that. Microsoft's compiler (at least when targeting 32-bit) still avoids PUSH when it can MOV. I think it matters too what variant of PUSH you're using. PUSH with a register operand is almost always a good idea, but the considerations change for the other encodings. Intel's manuals still recommend MOV to reg before PUSH. – Cody Gray - on strike Jul 24 '16 at 18:03
  • 2
    @CodyGray: `push` is a single uop with an immediate or register operand, on Pentium-M and later. `push reg` is very good for code density, since it's only one byte. (or two (REX prefix) for r8-r15). Mixing `push` and `mov [esp]` can take extra stack-sync uops. Fun fact: push is so cheap that clang uses `push rax` to align the stack before a `call` instead of `sub rsp, 8`. – Peter Cordes Jul 24 '16 at 23:29
-5

Like this: mov eax [value] Good luck at assembly programing it is hard.