In the original version of this question, the OP mentioned AMD64. The usual calling conventions on x86-64 pass the first few args in registers (like some 32-bit conventions such as MS's vectorcall), not on the stack, so of course you use mov
there.
mov edi, 10 ; x86-64 SysV calling convention.
mov esi, 20
call plus ; leaves eax = 30.
If you're writing functions that you only call from asm, you can make up custom calling conventions on a per-function basis, so you could use code like the above even for 32-bit. (But I'd recommend choosing registers that are not call-preserved in the normal ABI. For example gcc -m32
using __attribute__((regparm(3)))
on a function makes it take args in EAX, EDX, ECX, in that order). See the x86 tag wiki for links to ABI docs.
Replacing push
for putting args on the stack
If there's room above ESP
, you can mov
to that space instead of pushing args.
e.g. for two back-to-back call
s, instead of add esp, 8
/ push / push to clean the stack and push new args, you can just use mov
to write new args. GCC does this with -maccumulate-outgoing-args
), which some -mtune=
settings enable. See this mailing list message from 2014 where Honza describes the pros and cons, and why he disabled it for the default -mtune=generic
in 2014.
(Actually, gcc's -maccumulate-outgoing-args
is slightly different. It reserves space for args ahead of the first call using a sub
, so not even the first call uses push
. This increases code size but minimizes changes to esp
, so it shrinks the CFI metadata that maps instruction addresses to stack frame size, for stack-unwinding purposes with -fomit-frame-pointer
.
And on old CPUs where push is slower, avoids using it.)
;; Delayed arg popping. Push is cheap since Pentium-M stack engine so we use it for the first call
push 10
push 20
call foo
; add esp,8 ; don't do this yet, using mov instead of push
; then like gcc -maccumulate-outgoing-args
mov dword [esp+4], 15
mov [esp], eax
call bar
add esp,8 ; NOW pop the args.
This does bar( foo(20,10), 15)
, and leaves the stack pointing to the same place as before you started. Popping the stack (with add
) and then push
ing new args may do better on modern CPUs. This saves instructions but not code-size. It may save 1 uop.
On Intel's stack engine, an extra stack-sync uop is needed before the mov [esp+4]
or before add esp, 4
, so that cost in front-end uops is the same either way. The only way to avoid it would be to push new args, and do one big cleanup at the end (add esp, 16
). But that risks more cache misses from touching new stack space. (See the x86 tag wiki for more optimization docs).
This does minimize uop count for the uop-cache; the stack-sync uop is generated on the fly. But mov
tends to be larger than push
, especially for small immediate constants (2-byte push 1
vs. 8-byte mov dword [esp+4], 1
), so I-cache footprint may be better with add esp,8
/ 2x push
.