Is it faster to pop unneeded values from the stack, or add an immediate constant to SP on a 386+ CPU?

Question

My code target's 386+ (DOSBox usually, occasionally Pentium MMX) CPUs, but I only use the 8086 feature set for compatibility. My code is written for a non-multitasking environment (MS-DOS, or DOSBox.)

In nested loops I often find myself re-using CX for the deeper loop counters. I PUSH it at the top of the nested loop, and POP it before LOOP is executed.

Sometimes conditions other than CX reaching 0 terminate these inner loops. Then I am left with the unnecessary loop counter, and sometimes more variables, sitting on the stack that I need to clean up.

Is it faster to just add a constant to SP, or POP these unneeded values?

I know that the fastest way of doing it would be to store CX in a spare register at the top of the loop, then restore it before LOOP executes, foregoing the stack completely, but I often don't have a spare register.

Here's a piece of code where I add a constant to SP to avoid a couple POP instructions:

FIND_ENTRY PROC

;SEARCHES A SINGLE SECTOR OF A DIRECTORY LOADED INTO secBuff FOR A 
;SPECIFIED FILE/SUB DIRECTORY ENTRY

;IF FOUND, RETURNS THE FILE/SUB DIRECTORY'S CLUSTER NUMBER IN BX
;IF NOT FOUND, RETURNS 0 IN BX

;ALTERS BX

;EXPECTS A FILE NAME STRING INDEX NUMBER IN BP
;EXPECTS A SECTOR OF A DIRECTORY (ROOT, OR SUB) TO BE LOADED INTO secBuff
;EXPECTS DS TO BE LOADED WITH varData


    push ax
    push cx
    push es
    push si
    push di




    lea si, fileName             ;si -> file name strings 
    mov ax, 11d                  ;ax -> file name length in bytes/characters
    mul bp                       ;ax -> offset to file name string
    add si, ax                   ;ds:si -> desired file name as source input
                                 ;for CMPS
    mov di, ds
    mov es, di
    lea di, secBuff              ;es:di -> first entry in ds:secBuff as 
                                 ;destination input for CMPS


    mov cx, 16d                  ;outer loop cntr -> num entries in a sector
ENTRY_SEARCH:                    
    push cx                      ;store outer loop cntr
    push si                      ;store start of the file name
    push di                      ;store start of the entry


    mov cx, 11d                  ;inner loop cntr -> length of file name
    repe cmpsb                   ;Do the strings match?
    jne NOT_ENTRY                ;If not, test next entry.

    pop di                       ;di -> start of the entry
    mov bx, WORD PTR [di+26]     ;bx -> entry's cluster number

    add sp, 4                    ;discard unneeded stack elements
    pop di
    pop si
    pop es
    pop cx
    pop ax
    ret

NOT_ENTRY:                       
    pop di                       ;di -> start of the entry
    add di, 32d                  ;di -> start of next entry
    pop si                       ;si -> start of file name
    pop cx                       ;restore the outer loop cntr
    loop ENTRY_SEARCH            ;loop till we've either found a match, or
                                 ;have tested every entry in the sector 
                                 ;without finding a match.

    xor bx, bx                   ;if we're here no match was found. 
                                 ;return 0.




    pop di
    pop si
    pop es
    pop cx
    pop ax
    ret


FIND_ENTRY ENDP

There's no single answer that applies to both an actual 386 and a modern x86 like Haswell or Skylake! But if you care about performance, [`loop` is slow and should be avoided in the first place](https://stackoverflow.com/questions/35742570/why-is-the-loop-instruction-slow-couldnt-intel-have-implemented-it-efficiently), let alone push/pop of the loop counter! Use different regs for inner/outer loops. — Peter Cordes, Jan 30 '18 at 00:22
Basically, you've gone pretty far down the wrong path as far as optimizing, so the real answer is just to point you at http://agner.org/optimize/, and other performance links in [the x86 tag wiki](https://stackoverflow.com/tags/x86/info). 16-bit code makes it hard to get good performance because of all the partial-register false dependencies on modern CPUs, but with some impact on code size you can break those by using 32-bit operand size when necessary. (e.g. for `xor ebx,ebx`) — Peter Cordes, Jan 30 '18 at 00:26
a pop can/will cause a memory cycle at some level, anywhere from deep in the core to all the way out to slow dram. where mathmatical adjustment of the stack pointer to clean up the stack does not. So if that is what you are meaning to ask then obviously adjusting the stack pointer with a constant is faster, even if those cycles are cached at the lowest level. But then one has to ask why were they put on the stack in the first place. And you can avoid both solutions. — old_timer, Jan 30 '18 at 00:30
Something you might consider is some kind of calling convention (you can make up your own) that makes some registers volatile (caller saved) rather than all non-volatile (besides just BX) like this code. This could save you from saving and restoring all the registers. — Michael Petch, Jan 30 '18 at 00:30
@old_timer: you'd think `add sp,2` would be obviously faster, but if you're not bottlenecked on load port throughput, a modern out-of-order x86 CPU can have better throughput with `pop` because modifying `sp` *directly* requires a stack-sync uop. This is [why clang uses `push`/`pop` for stack alignment](https://stackoverflow.com/questions/37773787/why-does-this-function-push-rax-to-the-stack-as-the-first-operation) instead of `sub`/`add` in x86-64 code (when it's only a single push or pop). — Peter Cordes, Jan 30 '18 at 00:42
I understand all that Peter, clearly it has to do with scale, just a couple of pops, with an x86 and its baggage, and without indicating a specific core, specific motherboard implementation, dram, etc, not deterministic. A million pops, well that one is obvious, so somewhere in between there might be a transition point depending on the x86 and the environment around it as well as past code and data that has passed through it up to these instructions under test. — old_timer, Jan 30 '18 at 00:45

Peter Cordes · Accepted Answer · 2018-01-30T00:45:31.280

If you want to write efficient code, pop vs. add is a very minor issue compared to reducing the amount of saving/restoring you need to do, and optimizing everything else (see below).

If it would take more than 1 pop, always use add sp, imm. Or sub sp, -128 to save code-size by still using an imm8. Or some CPUs may prefer lea instead of add/sub. (e.g. gcc uses LEA whenever possible with -mtune=atom). Of course, this would require an address-size prefix in 16-bit mode because [sp+2] isn't a valid addressing mode.

Beyond that, there's no single answer that applies to both an actual 386 and a modern x86 like Haswell or Skylake! There have been a lot of microarchitectural changes between those CPUs. Modern CPUs decode x86 instructions to internal RISC-like uops. For a while, using simple x86 instructions was important, but now modern CPUs can represent a lot of work in a single instruction, so more complex x86 instructions (like push, or add with a memory source operand) are single uop instructions.

Modern CPUs (since Pentium-M) have a stack engine that removes the need for a separate uop to actually update RSP/ESP/SP in the out-of-order core. Intel's implementation requires a stack-sync uop when you read/write RSP with a non-stack instruction (anything other than push/pop / call/ret), which is why pop can be useful, especially if you're doing it after a push or call.

clang uses push/pop to align the stack in x86-64 code when a single 8-byte offset is needed. Why does this function push RAX to the stack as the first operation?.

But if you care about performance, loop is slow and should be avoided in the first place, let alone push/pop of the loop counter! Use different regs for inner/outer loops.

Basically, you've gone pretty far down the wrong path as far as optimizing, so the real answer is just to point you at http://agner.org/optimize/, and other performance links in the x86 tag wiki. 16-bit code makes it hard to get good performance because of all the partial-register false dependencies on modern CPUs, but with some impact on code size you can break those by using 32-bit operand size when necessary. (e.g. for xor ebx,ebx)

Of course, if you're optimizing for DOSBOX, it's not a real CPU and emulates. So loop may be fast! IDK if anyone has profiled or written optimization guides for DOSBOX's CPU emulator. But I'd recommend learning what's fast on real modern hardware; that's more interesting.

"I'd recommend learning what's fast on real modern hardware; that's more interesting." Maybe one day, but I have a hard enough time optimizing 8086 code. I haven't even figured out 386+ protected mode memory addressing, let alone the intricacies of modern assembly optimization. — bad, Jan 30 '18 at 00:54
@Mylifeisabug.: modern OSes use a flat memory model where all segment bases are 0; learning user-space x86 under Linux, Windows, or OS X is easier IMO, and more relevant. (And you can profile your code on real hardware; see https://stackoverflow.com/questions/44169342/can-x86s-mov-really-be-free-why-cant-i-reproduce-this-at-all/44193770#44193770 for an example of profiling a tiny loop in a static executable under Linux). — Peter Cordes, Jan 30 '18 at 02:14
But anyway, optimizing for real 8086 is mostly optimizing for code-size, because instruction-fetch (and mem in general) was the main bottleneck there. Learning to do that well has very little to do with modern x86 asm, except that when speed is equal, smaller is normally better for L1I cache and related indirect reasons. Code-size optimization can be fun, though: see https://codegolf.stackexchange.com/questions/77836/sort-an-integer-list/149038#149038 for a bubblesort in 19 bytes of machine code. — Peter Cordes, Jan 30 '18 at 02:15

Is it faster to pop unneeded values from the stack, or add an immediate constant to SP on a 386+ CPU?

1 Answers1