mov
leaves the data on the stack, pop
removes it so you can only read it once, and only in order. Data below ESP has to be considered "lost", unless you're using a calling convention / ABI that includes a red-zone below the stack pointer.
Data is usually still there below ESP, but asynchronous stuff like signal handlers, or a debugger evaluating a call fflush(0)
in the context of your process, can step on it.
Also, pop
modifies ESP, so each pop
requires stack-unwind metadata1 in another section of the executable/library, for it to be fully ABI compliant with SEH on Windows or the i386 / x86-64 System V ABI on other OSes (which specifies that all functions need unwind metadata, even if they're not C++ functions that actually support propagating exceptions).
But if you're reading data for the last time, and you actually need it all, then yes pop is an efficient way to read it on modern CPUs (like Pentium-M and later, with a stack engine to handle the ESP updates without a separate uop.)
On older CPUs, like Pentium III, pop
was actually slower than 3x mov
+ add esp,12
and compilers did generate code the way Brendan's answer shows.
void foo() {
asm("" ::: "ebx", "esi", "edi");
}
This function forces the compiler to save/restore 3 call-preserved registers (by declaring clobbers on them.) It doesn't actually touch them; the inline asm string is empty. But this makes it easy to see what compilers will do for saving/restoring. (Which is the only time they'll use pop
normally.)
GCC's default (tune=generic) code-gen, or with -march=skylake
for example, is like this (from the Godbolt compiler explorer)
foo: # gcc8.3 -O3 -m32
push edi
push esi
push ebx
pop ebx
pop esi
pop edi
ret
But telling it to tune for an old CPU without a stack engine makes it do this:
foo: # gcc8.3 -march=pentium3 -O3 -m32
sub esp, 12
mov DWORD PTR [esp], ebx
mov DWORD PTR [esp+4], esi
mov DWORD PTR [esp+8], edi
mov ebx, DWORD PTR [esp]
mov esi, DWORD PTR [esp+4]
mov edi, DWORD PTR [esp+8]
add esp, 12
ret
gcc thinks -march=pentium-m
doesn't have a stack engine, or at least chooses not to use push/pop
there. I think that's a mistake, because Agner Fog's microarch pdf definitely describes the stack engine as being present in Pentium-M.
On P-M and later, push/pop are single-uop instructions, with the ESP update handled outside the out-of-order backend, and for push the store-address+store-data uops are micro-fused.
On Pentium 3, they're 2 or 3 uops each. (Again, see Agner Fog's instruction tables.)
On in-order P5 Pentium, push
and pop
are actually fine. (But memory-destination instructions like add [mem], reg
were generally avoided, because P5 didn't split them into uops to pipeline better.)
Mixing pop
with direct references to [esp]
will actually be potentially slower than just one or the other, on modern Intel CPUs, because it costs extra stack-sync uops.
Obviously writing EAX 3 times back to back means the first 2 loads are useless in both sequences.
See Extreme Fibonacci for an example of pop (1 uop, or like 1.1 uop with the stack sync uops amortized) being more efficient than lodsd (2 uops on Skylake) for reading through an array. (In evil code that assumes a large red-zone because it doesn't install signal handlers. Don't actually do this unless you know exactly what you're doing and when it will break; this is more of a silly computer tricks / extreme optimization for code-golf than anything that's practically useful.)
Footnote 1: The Godbolt compiler explorer normally filters out extra assembler directives, but if you uncheck that box you can see gcc's function that uses push/pop has .cfi_def_cfa_offset 12
after every push/pop.
pop ebx
.cfi_restore 3
.cfi_def_cfa_offset 12
pop esi
.cfi_restore 6
.cfi_def_cfa_offset 8
pop edi
.cfi_restore 7
.cfi_def_cfa_offset 4
The .cfi_restore 7
metadata directives have to be there regardless of push/pop vs. mov, because that lets stack unwinding restore call-preserved registers as it unwinds. (7
is the register number).
But for other uses of push/pop inside a function (like pushing args to a function call, or a dummy pop
to remove it from the stack), you wouldn't have .cfi_restore
, only metadata for the stack pointer changing relative to the stack frame.
Normally you don't worry about this in hand-written asm, but compilers have to get this right so there's a small extra cost to using push/pop
in terms of total executable size. But only in parts of the file that aren't mapped into memory normally, and not mixed with code.