64

Those familiar with x86 assembly programming are very used to the typical function prologue / epilogue:

push ebp ; Save old frame pointer.
mov  ebp, esp ; Point frame pointer to top-of-stack.
sub  esp, [size of local variables]
...
mov  esp, ebp ; Restore frame pointer and remove stack space for locals.
pop  ebp
ret

This same sequence of code can also be implemented with the ENTER and LEAVE instructions:

enter [size of local variables], 0
...
leave
ret

The ENTER instruction's second operand is the nesting level, which allows multiple parent frames to be accessed from the called function.

This is not used in C because there are no nested functions; local variables have only the scope of the function they're declared in. This construct does not exist (although sometimes I wish it did):

void func_a(void)
{
    int a1 = 7;

    void func_b(void)
    {
        printf("a1 = %d\n", a1);  /* a1 inherited from func_a() */
    }

    func_b();
}

Python however does have nested functions which behave this way:

def func_a():
    a1 = 7
    def func_b():
        print 'a1 = %d' % a1      # a1 inherited from func_a()
    func_b()

Of course Python code isn't translated directly to x86 machine code, and thus would be unable (unlikely?) to take advantage of this instruction.

Are there any languages which compile to x86 and provide nested functions? Are there compilers which will emit an ENTER instruction with a nonzero second operand?

Intel invested a nonzero amount of time/money into that nesting level operand, and basically I'm just curious if anyone uses it :-)

References:

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Jonathon Reinhart
  • 132,704
  • 33
  • 254
  • 328
  • 12
    +1, the most interesting question of today. For 1), GCC supports [nested functions in C](https://gcc.gnu.org/onlinedocs/gcc/Nested-Functions.html) using exactly your syntax. But explicitly not in C++. – Iwillnotexist Idonotexist Oct 12 '14 at 08:26
  • 1
    @IwillnotexistIdonotexist I coincidentally just ran across that same page. Interestingly it compiles on gcc 4.7.2 with the default options. Looking forward to looking at the disassembly. Fun stuff! – Jonathon Reinhart Oct 12 '14 at 08:29
  • Even if it made sense to use it, that instruction is not particularly efficient. – harold Oct 12 '14 at 08:43
  • @harold Understood, but that is as it is normally used. I'd imagine that implementing `enter 200h, 31` via `mov`/`push` would be less efficient. – Jonathon Reinhart Oct 12 '14 at 08:51
  • 1
    Maybe so. I have the time for `ENTER a,b` listed here as `79 + 5b` on Nehalem (the number of µops scales even worse in `b`) (similar numbers apply to most architectures), it's sort of hard to do worse than that. – harold Oct 12 '14 at 08:55
  • @harold Wow, that is bad. I stand corrected! – Jonathon Reinhart Oct 12 '14 at 08:56
  • 8
    For what it is worth, I understand from `grep`-ing `gcc-4.8.2/gcc/config/i386/i386.c:10339` that GCC simply never emits `ENTER` at all nowadays. And the comment at that line is quite clear: `/* Note: AT&T enter does NOT have reversed args. Enter is probably slower on all targets. Also sdb doesn't like it. */` – Iwillnotexist Idonotexist Oct 12 '14 at 09:04
  • @IwillnotexistIdonotexist Very good to know. I think the instruction has been largely damned to obsolescence. Amazingly though, it is still valid in 64-bit mode, unlike many other obsolete instructions which AMD took the opportunity to can. – Jonathon Reinhart Oct 12 '14 at 09:10
  • 4
    @IwillnotexistIdonotexist FWIW, that was part of the very first version of GCC. `git log -p` on their cvs->svn->git converted repository shows that it already existed in the initial check-in in 1992. –  Oct 12 '14 at 09:15
  • 3
    And my private svn checkout of LLVM 3.5 has at `llvm/lib/Target/X86/X86FrameLowering.cpp:355` a comment for the `emitPrologue()` method which reads in part `; Spill general-purpose registers [for all callee-saved GPRs] pushq % [if not needs FP] .cfi_def_cfa_offset (offset from RETADDR) .seh_pushreg %`. There are no mentions of `ENTER`, only pushes; And the enum constant for x86 `ENTER` occurs only 3 times in all of LLVM; It doesn't even look as though they have testcases for it. – Iwillnotexist Idonotexist Oct 12 '14 at 09:38
  • 1
    So neither GCC nor LLVM produce `ENTER` ever, not even for `-Os`. If somebody can dig in the sources of ICC and MSVC (Har, har, fat chance of that happening) and confirm it never generates `ENTER`, you'll know that the answer to 2) is approximately _*no*_. – Iwillnotexist Idonotexist Oct 12 '14 at 09:50
  • FWIW, Pascal has nested functions, but none of the x86 Pascal compilers I know uses `ENTER`. – Rudy Velthuis Oct 12 '14 at 16:22

9 Answers9

57

enter is avoided in practice as it performs quite poorly - see the answers at "enter" vs "push ebp; mov ebp, esp; sub esp, imm" and "leave" vs "mov esp, ebp; pop ebp". There are a bunch of x86 instructions that are obsolete but are still supported for backwards compatibility reasons - enter is one of those. (leave is OK though, and compilers are happy to emit it.)

Implementing nested functions in full generality as in Python is actually a considerably more interesting problem than simply selecting a few frame management instructions - search for 'closure conversion' and 'upwards/downwards funarg problem' and you'll find many interesting discussions.

Note that the x86 was originally designed as a Pascal machine, which is why there are instructions to support nested functions (enter, leave), the pascal calling convention in which the callee pops a known number of arguments from the stack (ret K), bounds checking (bound), and so on. Many of these operations are now obsolete.

Community
  • 1
  • 1
gsg
  • 9,167
  • 1
  • 21
  • 23
  • 16
    +1 *"Note that the x86 was originally designed as a Pascal machine"* - I often wondered which high-level languages the designers had in mind when they added high-level language support instructions. Any additional historical perspective you could link to? – Jonathon Reinhart Oct 12 '14 at 10:40
  • 12
    Have a look at http://stevemorse.org/8086/ - Morse is the designer of the chip, and the chapters about Pascal and PL/M might be illuminating. – gsg Oct 12 '14 at 11:03
  • 4
    @JonathonReinhart once upon a time [structured programming](http://en.wikipedia.org/wiki/Structured_programming) was the silver bullet and [Pascal](http://en.wikipedia.org/wiki/Pascal_(programming_language)) influenced languages like [Modula](http://en.wikipedia.org/wiki/Modula-2) and especially [Ada](http://en.wikipedia.org/wiki/Ada_(programming_language)) which was "the language" of the United States Department of Defense. So hardware support of those languages does not surprise me – xmojmr Oct 12 '14 at 12:37
  • @gsg Thank you for that link! Lots of interesting insight into the design decisions that were made in the early days. – Jonathon Reinhart Oct 12 '14 at 18:44
  • Gcc has '-Os' option. When optimizing for size, one typically doesn't care, if the corresponding instruction is inefficient. – Aki Suihkonen Oct 13 '14 at 06:07
  • 5
    @AkiSuihkonen: `-Os` is not really "optimize for size" but rather "optimize for performance without performing any optimizations which are likely to adversely affect size". – R.. GitHub STOP HELPING ICE Oct 14 '15 at 20:18
  • 2
    Question about "Pascal Machine" and x86 https://retrocomputing.stackexchange.com/q/6959/8579 – Evan Carroll Jul 06 '18 at 08:43
  • 1
    GCC12 now has a `-Oz`, like clang's `-Oz`, which does aggressively optimize for machine-code size, even at a large cost in speed. (e.g. `push 1`/`pop rax` instead of `mov eax,1`). But neither GCC nor clang's `-Oz` currently has a peephole optimization to use `enter` at all, let alone with a non-zero nesting level. https://godbolt.org/z/efa4feTEh shows gcc/clang `-Oz -fno-omit-frame-pointer -mno-red-zone`, with clang ironically using `sub rsp,4` / `add rsp,4` instead of a dummy push/pop like it normally does. (Maybe it's also minimizing stack-space usage, even in leaf functions?) – Peter Cordes Sep 28 '22 at 21:52
14

As Iwillnotexist Idonotexist pointed out, GCC does support nested functions in C, using the exact syntax I've shown above.

However, it does not use ENTER instruction. Instead, variables which are used in nested functions are grouped together in the local variables area, and a pointer to this group is passed to the nested function. Interestingly, this "pointer to parent variables" is passed via a nonstandard mechanism: On x64 it is passed in r10, and on x86 (cdecl) it is passed in ecx, which is reserved for the this pointer in C++ (which doesn't support nested functions anyway).

#include <stdio.h>
void func_a(void)
{
    int a1 = 0x1001;
    int a2=2, a3=3, a4=4;
    int a5 = 0x1005;

    void func_b(int p1, int p2)
    {
        /* Use variables from func_a() */
        printf("a1=%d a5=%d\n", a1, a5);
    }
    func_b(1, 2);
}

int main(void)
{
    func_a();
    return 0;
}

Produces the following (snippet of) code when compiled for 64-bit:

00000000004004dc <func_b.2172>:
  4004dc:   push   rbp
  4004dd:   mov    rbp,rsp
  4004e0:   sub    rsp,0x10
  4004e4:   mov    DWORD PTR [rbp-0x4],edi
  4004e7:   mov    DWORD PTR [rbp-0x8],esi
  4004ea:   mov    rax,r10                    ; ptr to calling function "shared" vars
  4004ed:   mov    ecx,DWORD PTR [rax+0x4]
  4004f0:   mov    eax,DWORD PTR [rax]
  4004f2:   mov    edx,eax
  4004f4:   mov    esi,ecx
  4004f6:   mov    edi,0x400610
  4004fb:   mov    eax,0x0
  400500:   call   4003b0 <printf@plt>
  400505:   leave  
  400506:   ret    

0000000000400507 <func_a>:
  400507:   push   rbp
  400508:   mov    rbp,rsp
  40050b:   sub    rsp,0x20
  40050f:   mov    DWORD PTR [rbp-0x1c],0x1001
  400516:   mov    DWORD PTR [rbp-0x4],0x2
  40051d:   mov    DWORD PTR [rbp-0x8],0x3
  400524:   mov    DWORD PTR [rbp-0xc],0x4
  40052b:   mov    DWORD PTR [rbp-0x20],0x1005
  400532:   lea    rax,[rbp-0x20]              ; Pass a, b to the nested function
  400536:   mov    r10,rax                     ; in r10 !
  400539:   mov    esi,0x2
  40053e:   mov    edi,0x1
  400543:   call   4004dc <func_b.2172>
  400548:   leave  
  400549:   ret  

Output from objdump --no-show-raw-insn -d -Mintel

This would be equivalent to something more verbose like this:

struct func_a_ctx
{
    int a1, a5;
};

void func_b(struct func_a_ctx *ctx, int p1, int p2)
{
    /* Use variables from func_a() */
    printf("a1=%d a5=%d\n", ctx->a1, ctx->a5);
}

void func_a(void)
{
    int a2=2, a3=3, a4=4;
    struct func_a_ctx ctx = {
        .a1 = 0x1001,
        .a5 = 0x1005,
    };

    func_b(&ctx, 1, 2);
}
Community
  • 1
  • 1
Jonathon Reinhart
  • 132,704
  • 33
  • 254
  • 328
  • Interesting to see what `gcc -O0` does. It's probably rare for gcc not to inline a nested function with optimization enabled. Although maybe if there are many call-sites in the outer function... (especially if you optimize for size with `-Os`.) – Peter Cordes Sep 28 '17 at 12:58
  • 2
    @Peter The other case would be where the inner function is passed as a callback to some external function. It is then that the closure-stub on the stack is really necessary, as a single function pointer cannot otherwise encapsulate both the function address and its data. – Jonathon Reinhart Sep 28 '17 at 19:33
  • 2
    Oh right, I think I've seen gcc emit mov-immediate stores of x86 machine code for the stub you're talking about. And it emits assembler directives to mark the stack executable so this can work, so linking an object file that uses that will make your whole program's stack executable! http://lists.llvm.org/pipermail/cfe-dev/2015-September/045063.html (no clang support yet) – Peter Cordes Sep 28 '17 at 19:48
  • 1
    Here's an example of gcc writing machine-code bytes to the stack before passing a function-pointer to a nested function (to a function it can't see): https://godbolt.org/g/NaSZWp. – Peter Cordes Sep 28 '17 at 19:57
  • Yep, that's exactly what I was referring to. Pretty cool stuff, really! – Jonathon Reinhart Sep 29 '17 at 01:05
11

Our PARLANSE compiler (for fine-grain parallel programs on SMP x86) has lexical scoping.

PARLANSE tries to generate many, many small parallel grains of computation, and then multiplexes them on top of threads (1 per CPU). In fact, the stack frames are heap allocated; we didn't want to pay the price of a "big stack" for each grain since we have many, and we didn't want to put a limit on how deep anything could recurse. Because of parallel forks, the stack is actually a cactus stack.

Each procedure, on entry, builds a lexical display to enable access to surrounding lexical scopes. We considered using the ENTER instruction, but decided against it for two reasons:

  • As others have noted, it isn't particularly fast. MOV instructions do just as well.
  • We observed that the display is often sparse, and tends to be denser on the lexically deeper side. Most internal helper functions do fine with access only to their direct lexical parent; you don't always need access to all of your parents. Sometimes none.

Consequently, the compiler figures out exactly which lexical scopes a function needs access to, and generates, in the function prolog where ENTER would go, just the MOV instructions to copy the part of the parent's display that is actually needed. That often turns out to be 1 or 2 pairs of moves.

So we win twice on performance over using ENTER.

IMHO, ENTER is now one of those legacy CISC instructions, which seemed like a good idea at the time it was defined, but get outperformed by RISC instruction sequences that even Intel x86 optimizes.

Ira Baxter
  • 93,541
  • 22
  • 172
  • 341
  • 1
    This is the exact perspective I was hoping for; thank you. I'm still curious as to why AMD decided to keep ENTER in AMD64, even though it seems *no one* uses it. – Jonathon Reinhart Mar 15 '15 at 14:21
  • 7
    @JonathonReinhart: Making the decoders reject it in 64-bit mode but accept it in other modes might have *increased* complexity. AMD were very conservative about cleaning up the instruction set, because they weren't sure AMD64 would catch on, and didn't want to be stuck with more transistors that nobody used. We can basically blame capitalism for this huge missed opportunity to tidy up x86 machine code and change things that make a high-performance implementation tricky. (e.g. setcc could have changed to `setcc r/m32`, saving instructions to booleanize into an `int` instead of `char`) – Peter Cordes Sep 28 '17 at 13:08
3

I did some instruction counting statistics on Linux boots using the Simics virtual platform, and found that ENTER was never used. However,there were quite a few LEAVE instructions in the mix. There was almost a 1-1 correlation between CALL and LEAVE. That would seem to corroborate the idea that ENTER is just slow and expensive, while LEAVE is pretty handy. This was measured on a 2.6-series kernel.

The same experiments on a 4.4-series and a 3.14-series kernel showed zero use of either LEAVE or ENTER. Presumably, the gcc code generation for the newer gccs used to compile these kernels has stopped emitting LEAVE (or the machine options are set differently).

jakobengblom2
  • 5,531
  • 2
  • 25
  • 33
  • 5
    `-fomit-frame-pointer` is the default now. gcc still uses `leave` when it makes frame pointers. (It does so even in optimized code for functions with a VLA: https://godbolt.org/g/LF3Rrk). I tested with a few different `-mtune=` options, and they all used `leave`. clang doesn't use `leave`, though, ever. That's a missed optimization for `-Os` (optimize for size), because it's only 3 uops vs. at least 2 for mov/pop (and maybe a stack-sync uop). – Peter Cordes Sep 28 '17 at 13:21
  • 4
    gcc and clang don't use `enter` even if you compile with `-Os` or `-Oz`. `enter n,0` is 12 uops on Skylake, with 1 per 8 clocks throughput. On Ryzen, it's 12 uops with 1 per 16 clocks throughput. At `-Oz`: optimize for size at all costs, it might make sense for clang to use `enter`, because it does stuff like `push 2` / `pop rax` to save 2 bytes vs. `mov eax,2`. (gcc doesn't have a `-Oz` mode.) See http://agner.org/optimize/ for instruction tables and a microarch guide to make sense of them. See also [the SO x86 tag wiki](https://stackoverflow.com/tags/x86/info) – Peter Cordes Sep 28 '17 at 13:23
  • Thanks to @PeterCordes for the information. Fits what I see. – jakobengblom2 Oct 26 '17 at 18:42
  • 1
    This does not answer the question. The question was not what Linux uses or what GCC emits, but whether languages exist that _do_ use the instruction with a non-zero nesting level. – JdeBP Jun 22 '18 at 13:22
3

IMP77 (developed at Edinburgh University) allows nested routines/functions. The Intel version of the compiler uses the ENTER instruction sometimes with a non-zero level value.

  • Interesting! Can you provide a reference? And it would be cool if you could provide a code snippet with the generated assembly. – Jonathon Reinhart Sep 28 '22 at 13:26
  • Search for IMP77 on github (under siliconsam). Then search for ENTER in the source code for pass2.imp and pass3coff.c Source file pass2.imp indicates the code generation and pass3coff.c adds the actual ENTER instruction into the COFF object file. Example IMP code and generated Intel code to follow – John McMullin Sep 29 '22 at 15:45
2

The Vector Pascal compiler uses this instruction for procedure entry. Pascal allows arbitrary levels of nesting and the display supported by Enter is useful for this.

1

I use it in Pascal compilers. Although it is said to be slower than the equivalent code, it is more compact. The nesting limit of 31 is not a big deal, but the 64kb limit enter places on locals can be a problem. The solution I use is to emit an enter with 0 locals, then allocate the locals after the enter instruction. This only needs be done if the locals exceed 64kb.

There are several optimizations that can eliminate use of enter, and even of framing in general. For example a zero nested function need not use enter. Also, you can access locals via esp offsets, so you don't need the full ebp exchange.

BTW, I believe that the enter, leave and bound instructions were put in the 8086 instruction set specifically for Pascal. The reason being that at the time of introduction, Pascal was at the height of it's popularity, and had a need for all of those instructions.

The reason why enter is slower is because it is (literally) a "slow path" instruction. When the superscalar modes for the Pentium were being designed, the "deprecated" CISC instructions like enter, leave, and string compare and move and similar instructions were not translated as ROPs but sidelined for a microcoded engine to be sequenced. Most of the instructions get transcoded into internal ROPs or RISC operations, which are basically long word microcode instructions that perform the equivalent operation within the CPU.

That sounds counter-intuitive, one microcode to go slower than another, but the internal microcode for a CPU can be designed with very long control words for single, or very few cycle operations, but be shorter with lots of loops in them. Also, there is a difference between executing with microcode and translating TO microcode.

Scott Franco
  • 481
  • 2
  • 15
0

Corresponding Intel instructions showing IMP source and generated machine code ! main routine/program %begin 0000 C8 00 00 01 ENTER 0000,1

! global variable
%integer sum

! nested routine
%integer %function mcode001( %integer number, x )
 0004 EB 00                                 JMP L1001
 0006                      L1002  EQU $
 0006 C8 00 00 02                           ENTER 0000,2
    ! local variable
    %integer r

    r = number + x
 000A 8B 45 0C                              MOV EAX,[EBP+12]
 000D 03 45 08                              ADD EAX,[EBP+8]
 0010 89 45 F4                              MOV [EBP-12],EAX

    %result = r
 0013 8B 45 F4                              MOV EAX,[EBP-12]
 0016 C9                                    LEAVE
 0017 C3                                    RET
%end
 0018                      L1001  EQU $

! call the nested routine
sum = mcode001(46,24)&255
 0018 6A 2E                                 PUSH 46
 001A 6A 18                                 PUSH 24
 001C E8 00 00                              CALL 'MCODE001' (INTERNAL L1002 )
 001F 83 C4 08                              ADD ESP,8
 0022 25 FF 00 00 00                        AND EAX,255
 0027 89 45 F8                              MOV [EBP-8],EAX

! show the result itos converts binary integer to text
printstring("Result =".itos(sum,3)); newline
 002A FF 75 F8                              PUSH WORD PTR [EBP-8]
 002D 6A 03                                 PUSH 3
 002F 8D 85 F8 FE FF FF                     LEA EAX,[EBP-264]
 0035 50                                    PUSH EAX
 0036 E8 42 00                              CALL 'ITOS' (EXTERN 66)
 0039 83 C4 0C                              ADD ESP,12
 003C 8D 85 F8 FD FF FF                     LEA EAX,[EBP-520]
 0042 50                                    PUSH EAX
 0043 B8 00 00 00 00                        MOV EAX,COT+0
 0048 50                                    PUSH EAX
 0049 68 FF 00 00 00                        PUSH 255
 004E E8 03 00                              CALL '_IMPSTRCPY' (EXTERN 3)
 0051 83 C4 0C                              ADD ESP,12
 0054 8D 85 F8 FD FF FF                     LEA EAX,[EBP-520]
 005A 50                                    PUSH EAX
 005B 8D 85 F8 FE FF FF                     LEA EAX,[EBP-264]
 0061 50                                    PUSH EAX
 0062 68 FF 00 00 00                        PUSH 255
 0067 E8 05 00                              CALL '_IMPSTRCAT' (EXTERN 5)
 006A 83 C4 0C                              ADD ESP,12
 006D 81 EC 00 01 00 00                     SUB ESP,256
 0073 89 E0                                 MOV EAX,ESP
 0075 50                                    PUSH EAX
 0076 8D 85 F8 FD FF FF                     LEA EAX,[EBP-520]
 007C 50                                    PUSH EAX
 007D 68 FF 00 00 00                        PUSH 255
 0082 E8 03 00                              CALL '_IMPSTRCPY' (EXTERN 3)
 0085 83 C4 0C                              ADD ESP,12
 0088 E8 34 00                              CALL 'PRINTSTRING' (EXTERN 52)
 008B 81 C4 00 01 00 00                     ADD ESP,256
 0091 E8 3C 00                              CALL 'NEWLINE' (EXTERN 60)

%endofprogram
 0094 C9                                    LEAVE
 0095 C3                                    RET
      _TEXT  ENDS
      CONST  SEGMENT WORD PUBLIC 'CONST'
 0000                                       db 08,52 ; .R
 0002                                       db 65,73 ; es
 0004                                       db 75,6C ; ul
 0006                                       db 74,20 ; t.
 0008                                       db 3D,00 ; =.
      CONST  ENDS
      _TEXT  SEGMENT WORD PUBLIC 'CODE'
             ENDS
      DATA  SEGMENT WORD PUBLIC 'DATA'
      DATA    ENDS
              ENDS
      _SWTAB  SEGMENT WORD PUBLIC '_SWTAB'
      _SWTAB   ENDS
  • 2
    Hey John, thanks for the update. Rather than adding a new answer, please [edit your original answer](https://stackoverflow.com/posts/73873805/edit) and add this content there. Then delete these other two answers. Thanks! – Jonathon Reinhart Sep 29 '22 at 18:01
-1
! main routine/program
%begin

! global variable
%integer sum

! nested routine
%integer %function mcode001( %integer number, x )
    ! local variable
    %integer r

    r = number + x

    %result = r
%end

! call the nested routine
sum = mcode001(46,24)&255

! show the result itos converts binary integer to text
printstring("Result =".itos(sum,3)); newline

%endofprogram
  • 1
    Hey John, thanks for the update. Rather than adding a new answer, please [edit your original answer](https://stackoverflow.com/posts/73873805/edit) and add this content there. Then delete these other two answers. Thanks! – Jonathon Reinhart Sep 29 '22 at 18:02