Is there advantage of reading data without using pop operation?

Question

According to this PDF document (Page-66), the following bunch of statements

mov eax, DWORD PTR SS:[esp]
mov eax, DWORD PTR SS:[esp + 4]
mov eax, DWORD PTR SS:[esp + 8]

are equivalent to the following bunch of statements:

pop eax
pop eax
pop eax

Is there any advantage of the the former over the latter?

@PeterCordes, why are you writing comments? why not a complete answer with all these comments? — user366312, Apr 05 '19 at 12:19
Because I wasn't sure what the context of the question was. I wondered if you had some kind of use-case for `pop` vs. `mov` you were imagining, where you only needed to read data once but it didn't matter whether you adjusted ESP or not for some reason. But you seemed to be assuming that `pop` was *more* efficient, so you hadn't been reading old optimization manuals for early PPro / Pentium III CPUs that suggested avoiding `pop` in favour of `mov`. It's still an unclear question with no obvious general answer, just some things I could comment on. — Peter Cordes, Apr 05 '19 at 12:23

score 1 · Accepted Answer · answered Apr 06 '19 at 00:12

mov leaves the data on the stack, pop removes it so you can only read it once, and only in order. Data below ESP has to be considered "lost", unless you're using a calling convention / ABI that includes a red-zone below the stack pointer.

Data is usually still there below ESP, but asynchronous stuff like signal handlers, or a debugger evaluating a call fflush(0) in the context of your process, can step on it.

Also, pop modifies ESP, so each pop requires stack-unwind metadata¹ in another section of the executable/library, for it to be fully ABI compliant with SEH on Windows or the i386 / x86-64 System V ABI on other OSes (which specifies that all functions need unwind metadata, even if they're not C++ functions that actually support propagating exceptions).

But if you're reading data for the last time, and you actually need it all, then yes pop is an efficient way to read it on modern CPUs (like Pentium-M and later, with a stack engine to handle the ESP updates without a separate uop.)

On older CPUs, like Pentium III, pop was actually slower than 3x mov + add esp,12 and compilers did generate code the way Brendan's answer shows.

void foo() {
    asm("" ::: "ebx", "esi", "edi");
}

This function forces the compiler to save/restore 3 call-preserved registers (by declaring clobbers on them.) It doesn't actually touch them; the inline asm string is empty. But this makes it easy to see what compilers will do for saving/restoring. (Which is the only time they'll use pop normally.)

GCC's default (tune=generic) code-gen, or with -march=skylake for example, is like this (from the Godbolt compiler explorer)

foo:                        # gcc8.3 -O3 -m32
        push    edi
        push    esi
        push    ebx
        pop     ebx
        pop     esi
        pop     edi
        ret

But telling it to tune for an old CPU without a stack engine makes it do this:

foo:                     # gcc8.3  -march=pentium3 -O3 -m32
        sub     esp, 12
        mov     DWORD PTR [esp], ebx
        mov     DWORD PTR [esp+4], esi
        mov     DWORD PTR [esp+8], edi
        mov     ebx, DWORD PTR [esp]
        mov     esi, DWORD PTR [esp+4]
        mov     edi, DWORD PTR [esp+8]
        add     esp, 12
        ret

gcc thinks -march=pentium-m doesn't have a stack engine, or at least chooses not to use push/pop there. I think that's a mistake, because Agner Fog's microarch pdf definitely describes the stack engine as being present in Pentium-M.

On P-M and later, push/pop are single-uop instructions, with the ESP update handled outside the out-of-order backend, and for push the store-address+store-data uops are micro-fused.

On Pentium 3, they're 2 or 3 uops each. (Again, see Agner Fog's instruction tables.)

On in-order P5 Pentium, push and pop are actually fine. (But memory-destination instructions like add [mem], reg were generally avoided, because P5 didn't split them into uops to pipeline better.)

Mixing pop with direct references to [esp] will actually be potentially slower than just one or the other, on modern Intel CPUs, because it costs extra stack-sync uops.

Obviously writing EAX 3 times back to back means the first 2 loads are useless in both sequences.

See Extreme Fibonacci for an example of pop (1 uop, or like 1.1 uop with the stack sync uops amortized) being more efficient than lodsd (2 uops on Skylake) for reading through an array. (In evil code that assumes a large red-zone because it doesn't install signal handlers. Don't actually do this unless you know exactly what you're doing and when it will break; this is more of a silly computer tricks / extreme optimization for code-golf than anything that's practically useful.)

Footnote 1: The Godbolt compiler explorer normally filters out extra assembler directives, but if you uncheck that box you can see gcc's function that uses push/pop has .cfi_def_cfa_offset 12 after every push/pop.

        pop     ebx
        .cfi_restore 3
        .cfi_def_cfa_offset 12
        pop     esi
        .cfi_restore 6
        .cfi_def_cfa_offset 8
        pop     edi
        .cfi_restore 7
        .cfi_def_cfa_offset 4

The .cfi_restore 7 metadata directives have to be there regardless of push/pop vs. mov, because that lets stack unwinding restore call-preserved registers as it unwinds. (7 is the register number).

But for other uses of push/pop inside a function (like pushing args to a function call, or a dummy pop to remove it from the stack), you wouldn't have .cfi_restore, only metadata for the stack pointer changing relative to the stack frame.

Normally you don't worry about this in hand-written asm, but compilers have to get this right so there's a small extra cost to using push/pop in terms of total executable size. But only in parts of the file that aren't mapped into memory normally, and not mixed with code.

score 0 · Answer 2 · answered Apr 05 '19 at 13:08

This:

pop eax
pop ebx
pop ecx

.. is sort of equivalent to this:

mov eax,[esp]
add esp,4

mov ebx,[esp]
add esp,4

mov ecx,[esp]
add esp,4

..which can be like this:

mov eax,[esp]     ;Do this instruction
add esp,4         ; ...and this instruction in parallel

                   ;Stall until the previous instruction completes (and the value
mov ebx,[esp]      ;in ESP becomes known); then do this instruction
add esp,4          ; ...and this instruction in parallel

                   ;Stall until the previous instruction completes (and the value
mov ecx,[esp]      ;in ESP becomes known); then do this instruction
add esp,4          ; ...and this instruction in parallel

For this code:

mov eax, [esp]
mov ebx, [esp + 4]
mov ecx, [esp + 8]
add esp,12

.. all of the instructions can happen in parallel (in theory).

Note: In practice all of the above depends on which CPU, etc.

`add` will of course alter the flags where as `pop` doesn't. You could replace `add` with an equivalent `lea` that keps the flags intact. — Michael Petch, Apr 05 '19 at 13:15
Modern x86 CPUs all have a stack engine that handles the ESP updates outside the out-of-order backend, handling the dependency chain between pop instructions with 0 cycle latency, so the loads *can* run in parallel. — Peter Cordes, Apr 05 '19 at 20:54

Is there advantage of reading data without using pop operation?

2 Answers2