Usage of pusha/popa in function prologue/epilogue?

Question

Reading assembly tutorials, I saw that "function" prologue/epilogue consist in :

push bp
mov bp, sp  
---
pop bp

But I also saw some other tutorials using pusha then popa to preserve registers. So why, function prologue/epilogue doesn't perform a pusha/popa to save registers context in addition to setup bp ?

There are different calling conventions (between caller and callee), about which registers can be freely used/overwritten, which have to be saved and restored and which registers are used for passing parameters. — Sebastian, Dec 30 '21 at 11:53
yep I saw cdecl, sdt, and fast convention, but none seems to use pusha/popa ? — 8HoLoN, Dec 30 '21 at 11:55
pusha and popa can be used to fulfill those calling conventions. But the only advantage seems to be code size: https://stackoverflow.com/questions/48449166/x86-assembly-pushad-popad-how-fast-it-is/48455374#48455374 — Sebastian, Dec 30 '21 at 12:05
The prologue/epilogue just has to fulfill the convention. E.g. the bp register does not necessarily has to be used for accessing the stackframe, you could use any other register, but bp has the advantage to default to the SS segment, when used in addressing. — Sebastian, Dec 30 '21 at 12:09
Why should the function waste time saving registers i doesn't need to save? — fuz, Dec 30 '21 at 13:09
If you are short on RAM or want to save memory for other reasons (improve caching or hacking/patching existing programs; also the assembler programs are shorter. But you also could write an assembler macro instead. If the instruction had been used more often, compiler manufacturers like Intel would have optimized their processor designs for it. — Sebastian, Dec 30 '21 at 13:29
@fuz because this way you don't have to know all the registers usage of all functions, but simply be sure that any function calls would not messed up registers for current function ? (so I admit it it's not the fastest at runtime, but the fastest at the conception). ie; if function a call b which call c (a->b->c) a use ax but b no, so ax is not saved, but if c use ax then it should be saved even if b doesn't care. So does that mean that all the call chain should be analysed at build time ? — 8HoLoN, Dec 30 '21 at 13:54
@8HoLoN The compiler knows which registers each function it compiles uses. So it can save a lot of time by only saving those registers it plans to use. Additionally, some registers are caller saved and do not need to be saved at all. This is a lot better than always saving all registers. — fuz, Dec 30 '21 at 13:56
Also if you `popa` you obviously can't return a value in a simple way. — Jester, Dec 30 '21 at 13:59
Is the question only about pusha/popa or why no registers are saved except bp? Depending on the calling convention the other registers do not have to be preserved. Depending on the function implementation the other registers are not modified. The shown epilogue/prologue creates a standard stackframe to access parameters and local variables on the stack for addressing relative to bp. The saving and restoring of registers would have to be additionally/manually done. — Sebastian, Dec 30 '21 at 14:25
@Sebastian ok thank, so does that also mean that if a function doesn't perform any action with the stack (has no argument and local variable), the saving of bp is not performed ? Or does the standard stackframe is always used even when not necessary ? — 8HoLoN, Dec 30 '21 at 15:56
This prologue/epilogue is part of the function. It depends, with what tool or programming language the function is created (your whole program including libraries can be created by a mix of programming languages). Do you use an assembler for programming the function or a high-level-language? The CPU and hardware system normally do not prescribe building a stackframe (especially for user code, interrupt routines sometimes are implemented differently). For high-level languages it is more standardized or done by the optimizer automatically, with assembler you have more freedom. — Sebastian, Dec 30 '21 at 16:10
Probably related: https://stackoverflow.com/questions/42208087/are-the-prologue-and-epilogue-mandatory-when-writing-assembly-functions Are the prologue and epilogue mandatory when writing assembly functions? — Sebastian, Dec 30 '21 at 16:15
@Sebastian I'm experimentating writing an high level language/library that produce the opcode (basically writing a binary file which is a bootloader). — 8HoLoN, Dec 30 '21 at 16:34
So your own language - then you can create your own rules! Each function of yours can have its individual calling convention or none at all. — Sebastian, Dec 30 '21 at 16:40

score 2 · Accepted Answer · answered Dec 30 '21 at 16:31

Reading assembly tutorials, I saw that "function" prologue/epilogue...

For true assembly language (where you don't have to comply with the calling conventions of a different language) the words "function prologue/epilogue" don't make sense.

For "assembly language designed to comply with some other language's calling conventions"; you only need to save/restore some registers (possibly none).

For an example; for CDECL, the contents of EAX, ECX, and EDX can be trashed by the callee and never need to be saved/restored by the callee (the caller needs to save them if they care); and if a function doesn't use any other registers the callee doesn't need to save or restore any other registers either. Also note that "EBP as frame pointer" is antiquated rubbish (it existed because debuggers weren't very good and became pointless when debugging info improved - e.g. DWARF debugging info, etc). These things combined mean that something like this has acceptable prologue and epilogue for CDECL:

    myFunction:
            mov eax,12345      ;eax = returned value
            ret

If "lots" of registers do need to be saved and restored; pusha is slow (micro-coded), and a series of multiple push instructions is also slow (the address of a store depends on the value in ESP which was recently modified). The typical way is to do it yourself, like:

                        ;Don't bother saving EAX, ECX, EDX.
    sub esp,16          ;Create space to save 4 registers (but maybe more for local variables)

    mov [esp],ebx
    mov [esp+4],esi
    mov [esp+8],edi
    mov [esp+12],ebp

However; the cost is space. In a boot loader where code size is extremely limited (e.g. "part of 512 bytes") a smart programmer will use true assembly language (where "function prologue/epilogue" don't make sense) and a beginner might use pusha to save space (without realizing that they have no reason to care about other programming language's calling conventions).

So you say, it's better (from a cpu performance pov), to manuallly make room for the stack with sp then mov instead of using the push opcode ? And not o rely on bp to reach the arguments or var, but to use sp and then restore sp with `add esp, 16` ? — 8HoLoN, Dec 30 '21 at 17:09
What do you plan to use (16, 32, 64 bits)? Real mode / protected mode? — Sebastian, Dec 30 '21 at 19:52
@8HoLoN: In 16-bit mode like a bootloader, you often would want to use `bp` as a frame pointer, because `[sp+2]` isn't a valid addressing mode. (Although you could use `[esp+2]`, but that costs extra code size for an address-size prefix as well as a SIB byte, and code-size is often at a premium in a legacy BIOS bootloader.) Using pure `mov` instead of multiple `push` takes more code size, too, and isn't even faster on CPUs [since Pentium-M which have a "stack engine"](https://stackoverflow.com/q/36631576/224132) to bypass the serial dependency on the stack pointer for stack ops. — Peter Cordes, Dec 30 '21 at 20:00
@Sebastian First, it is 16 bit mode (to simplify the problem), then progressively use 32 then 64 bits mode (making the library dynamic to calculate all the stuff depending on the mode used). But for the moment it is 16 bit mode. Actually, the will of the library (written in JS/TS) is to provide a unique api to write any of the level from raw bytes, then asm equivalent with functionnal approach to high level concept like in C/C++ and JS. But I am not focusing on network, only code design for now struct/object/function/eventloop/promise/observable, that's a lot but at least I will learn. — 8HoLoN, Dec 30 '21 at 20:16

Olsonist · Answer 2 · 2021-12-30T18:15:36.650

They don't save all registers because you don't always need all registers saved. Saving them and restoring them is slow. Yeah, it's a small single instruction which seems like a saving but it takes time and stack space. To get an idea of what is saved look at the calling conventions.

https://en.wikipedia.org/wiki/X86_calling_conventions

PUSHA/PUSHAD—Push All General-Purpose Registers

These are slow instructions. On Skylake, PUSHA takes 19 uops and 8 cycles of throughput. POPA takes 18 uops and 8c of throughput.

Also, PUSHA/PUSHAD are invalid in 64-bit. They were rightfully purged by AMD and then by Intel when x86 was extended to 64b.

Modern compilers go the other direction and avoid saving registers if possible. LLVM performs an analysis called shrink wrapping where the prolog gets pushed forward to allow fast early exit.

https://llvm.org/doxygen/ShrinkWrap_8cpp_source.html

These are terrible, horrible, no good, very bad instructions.

Usage of pusha/popa in function prologue/epilogue?

2 Answers2