61

What is the difference between the enter and

push ebp
mov  ebp, esp
sub  esp, imm

instructions? Is there a performance difference? If so, which is faster and why do compilers always use the latter?

Similarly with the leave and

mov  esp, ebp
pop  ebp

instructions.

janw
  • 8,758
  • 11
  • 40
  • 62
小太郎
  • 5,510
  • 6
  • 37
  • 48

4 Answers4

57

There is a performance difference, especially for enter. On modern processors this decodes to some 10 to 20 µops, while the three instruction sequence is about 4 to 6, depending on the architecture. For details consult Agner Fog's instruction tables.

Additionally the enter instruction usually has a quite high latency, for example 8 clocks on a core2, compared to the 3 clocks dependency chain of the three instruction sequence.

Furthermore the three instruction sequence may be spread out by the compiler for scheduling purposes, depending on the surrounding code of course, to allow more parallel execution of instructions.

Gunther Piez
  • 29,760
  • 6
  • 71
  • 103
  • 1
    May I ask where you get this information? And what about for `leave`? – 小太郎 May 12 '11 at 07:59
  • 10
    See http://www.agner.org/optimize/microarchitecture.pdf for a global overview how the processor executes code and http://www.agner.org/optimize/instruction_tables.pdf for detailed instruction latencies. `leave` is on some architectures equal in performance, but AFAIK in no case faster. It consumes less memory in the instruction cache, though – Gunther Piez May 12 '11 at 08:24
  • 3
    If the 3 instruction sequence is faster than `enter`, what is the point of it? – 小太郎 May 14 '11 at 00:31
  • 7
    Compatibility. It has been around since the 8086 and it most likely for ever will. The same goes for the `loop` instruction: It is way slower than `dec reg; jnz`, but it is still there because some old software might possibly use it. – Gunther Piez May 14 '11 at 09:04
  • 9
    Enter/leave were not in the 8086/8. I believe they were added in the 80186/8 as those (rarely used) chips had all the real mode instructions of the iapx286 (which is well documented to have enter/leave). – Brian Knoblauch Jul 05 '13 at 13:24
  • 1
    There isn't a 3-cycle dependency chain for `push ebp` / `mov ebp,esp` / `sub esp, imm`. The `mov` in the middle has to read the modified ESP from `push` (costing a stack-sync uop on Intel), but it does *not* modify ESP itself. `sub esp, imm` can thus execute in parallel with the `mov`. The stack engine (since Pentium-M) also means `push ebp` doesn't have a separate latency cost for ESP, it's just another stack offset added to the one from the `call` that reached this code in the first place. – Peter Cordes Jul 05 '22 at 01:49
8

When designing the 80286, Intel's CPU designers decided to add two instructions to help maintain displays.

Here the micro code inside the CPU:

; ENTER Locals, LexLevel

push    bp              ;Save dynamic link.
mov     tempreg, sp     ;Save for later.
cmp     LexLevel, 0     ;Done if this is lex level zero.
je      Lex0

lp:
dec     LexLevel
jz      Done            ;Quit if at last lex level.
sub     bp, 2           ;Index into display in prev act rec
push    [bp]            ; and push each element there.
jmp     lp              ;Repeat for each entry.

Done:
push    tempreg         ;Add entry for current lex level.

Lex0:
mov     bp, tempreg     ;Ptr to current act rec.
sub     sp, Locals      ;Allocate local storage

Alternative to ENTER would be:

; enter n, 0 ;14 cycles on the 486

push    bp              ;1 cycle on the 486
sub     sp, n           ;1 cycle on the 486

; enter n, 1 ;17 cycles on the 486

push    bp              ;1 cycle on the 486
push    [bp-2]          ;4 cycles on the 486
mov     bp, sp          ;1 cycle on the 486
add     bp, 2           ;1 cycle on the 486
sub     sp, n           ;1 cycle on the 486

; enter n, 3 ;23 cycles on the 486

push    bp              ;1 cycle on the 486
push    [bp-2]          ;4 cycles on the 486
push    [bp-4]          ;4 cycles on the 486
push    [bp-6]          ;4 cycles on the 486
mov     bp, sp          ;1 cycle on the 486
add     bp, 6           ;1 cycle on the 486
sub     sp, n           ;1 cycle on the 486

Etc. The long way might increase your file size, but is way quicker.

One last note, programmer don't really use display anymore since that was a very slow work around, making ENTER pretty useless now.

Source: https://courses.engr.illinois.edu/ece390/books/artofasm/CH12/CH12-3.html

Alexis Wilke
  • 19,179
  • 10
  • 84
  • 156
Pr0c3ss0r
  • 129
  • 1
  • 3
  • 4
    The "; enter n, 0 ;14 cycles on the 486" example is missing the `mov bp, sp` line. And `enter` and `leave` appeared on the 186, not 286. – ecm Jul 12 '20 at 12:14
  • 2
    Here is a [PDF with the 80186 instruction set](https://www.jamieiles.com/80186/development-guide.pdf) and we can find the ENTER and LEAVE instructions in it. Interestingly enough, I found [this 286 book](http://bitsavers.org/components/intel/80286/210498-001_iAPX_286_Programmers_Reference_1983.pdf) which says that the instructions were brand new in the 80286 processor. So I can understand the confusion. – Alexis Wilke Feb 27 '23 at 07:40
  • 1
    @AlexisWilke: Intel documentation for later CPUs typically pretends 186 didn't exist; it was only intended for embedded stuff, not mainstream PCs which ran the same applications as 8086 and 286 PCs. – Peter Cordes Feb 28 '23 at 06:13
6

There is no real speed advantage using either of them, though the long method will probably run better due to the fact CPU's these days are more 'optimized' to the shorter simpler instructions that are more generic in use (plus it allows saturation of the execution ports if your lucky).

The advantage of LEAVE (which is still used, just see the windows dlls) is that its smaller than manually tearing down a stack frame, this helps a lot when your space is limited.

The Intel instruction manuals (volume 2A to be precise) will have more nitty gritty details on the instructions, so should Dr Agner Fogs Optimization manuals

Necrolis
  • 25,836
  • 3
  • 63
  • 101
  • I tested the `ENTER` vs `PUSH/MOV` on my Xeon and I get about 2x difference. So if you have a loop in the millions over a `CALL`, it would start to be slightly different. With 100 million iterations (what I tried), it's like 1 second. So if you have `CALL`s within the call, it multiplies the delay... I think that for a compiler, it's definitely a good optimization. – Alexis Wilke Feb 28 '23 at 14:50
6

enter is unusably slow on all CPUs, nobody uses it except maybe for code-size optimization at the expense of speed. (If a frame pointer is needed at all, or desired to allow more compact addressing modes for addressing stack space.)

leave is fast enough to be worth using, and GCC does use it (if ESP / RSP isn't already pointing at a saved EBP/RBP; otherwise it just uses pop ebp).

leave is only 3 uops on modern Intel CPUs (and 2 on some AMD). (https://agner.org/optimize/, https://uops.info/).

mov / pop is only 2 uops total (on modern x86 where a "stack engine" tracks updates to ESP/RSP). So leave is just one more uop than doing things separately. I've tested this on Skylake, comparing a call/ret in a loop with the function setting up a traditional frame pointer and tearing down its stack frame using mov/pop or leave. perf counters for uops_issued.any shows one more front-end uop when you use leave than for mov/pop. (I ran my own test in case other measurement methods has been counting a stack-sync uop in their leave measurements, but using it in a real function controls for that.)

Possible reasons why older CPUs might have benefited more keeping mov / pop split up:

  • In most CPUs without a uop cache (i.e. Intel before Sandybridge, AMD before Zen), multi-uop instructions can be a decode bottleneck. They can only decode in the first ("complex") decoder, so might mean the decode cycle before that produced fewer uops than normal.

  • Some Windows calling conventions are callee-pops stack args, using ret n. (e.g. ret 8 to do ESP/RSP += 8 after popping the return address). This is a multi-uop instruction, unlike plain near ret on modern x86. So the above reason goes double: leave and ret 12 couldn't decode in the same cycle

  • Those reasons also apply to legacy decode to build uop-cache entries.

  • P5 Pentium also preferred a RISC-like subset of x86, being unable to even break up complex instructions into separate uops at all.

For modern CPUs, leave takes up 1 extra uop in the uop cache. And all 3 have to be in the same line of the uop cache, which could lead to only partial filling of the previous line. So larger x86 code size could actually improve packing into the uop cache. Or not, depending on how things line up.

Saving 2 bytes (or 3 in 64-bit mode) may or may not be worth 1 extra uop per function.

GCC favours leave, clang and MSVC favour mov/pop (even with clang -Oz code-size optimization even at the expense of speed, e.g. doing stuff like push 1 / pop rax (3 bytes) instead of 5-byte mov eax,1).

ICC favours mov/pop, but with -Os will use leave. https://godbolt.org/z/95EnP3G1f

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847