35

As was advised long time ago, I always build my release executables without frame pointers (which is the default if you compile with /Ox).

However, now I read in the paper http://research.microsoft.com/apps/pubs/default.aspx?id=81176, that frame pointers don't have much of an effect on performance. So optimizing it fully (using /Ox) or optimizing it fully with frame pointers (using /Ox /Oy-) doesn't really make a difference on peformance.

Microsoft seems to indicate that adding frame pointers (/Oy-) makes debugging easier, but is this really the case?

I did some experiments and noticed that:

  • in a simple 32-bit test executable (compiled using /Ox /Ob0) the omission of frame pointers does increase performance (with about 10%). But this test executable only performs some function calls, nothing else.
  • in my own application the adding/removing of frame pointers don't seem to have a big effect. Adding frame pointers seems to make the application about 5% faster, but that could be within the error margin.

What is the general advice regarding frame pointers?

  • should they be omitted (/Ox) in a release executable because they really have a positive effect on performance?
  • should they be added (/Ox /Oy-) in a release executable because they improve debug-ablity (when debugging with a crash-dump file)?

Using Visual Studio 2010.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Patrick
  • 23,217
  • 12
  • 67
  • 130
  • Function calls are about the only thing it optimises, and only by a few intructions per call. It does reduce the stack space needed (which doesn't matter unless you're doing deep recursion) – John Dvorak Oct 22 '12 at 07:06
  • 1
    I always felt frame pointers are kinda redundant if you know where your stack pointer is (which, for the compiler, is easy). – John Dvorak Oct 22 '12 at 07:11
  • 2
    I assume the "10%" speed benefit was with 32-bit x86, which only has 7 general-purpose registers (not including the stack pointer), so spending 1 of them as a frame pointer is a big deal. – Peter Cordes Oct 11 '20 at 11:57
  • @PeterCordes You're right. the last few years I didn't think anymore about frame pointers since we're all 64-bit now (and the mentioned compiler option doesn't exist in 64-bit Visual Studio). The number of registers could indeed play an important role in this. – Patrick Oct 12 '20 at 06:11
  • 4
    GCC supports `-fno-omit-frame-pointer` in any mode (and for non-x86 targets). It's a fairly well known fact that the incremental value of 1 more usable register increases the fewer total you have, from CS papers that have explored this by e.g. compiling SpecInt for various simulated machines and looking at code size and dynamic instruction count. However, on x86 (including x86-64), frame pointers can sometimes save code-size in code that references stack vars often; `[rbp+-disp8]` has a smaller encoding than `[rsp +- disp8]`. So it can sometimes actually hurt instead of help to omit FPs. – Peter Cordes Oct 12 '20 at 06:28

1 Answers1

49

Phoronix tested the performance downside of -O2 -fno-omit-frame-pointer with x86-64 GCC 12.1 on a Zen 3 laptop CPU for multiple open-source programs, as proposed for Fedora 37. Most of them had performance regressions, a few of them very serious, although the biggest ones are probably some kind of fluke or other interaction. Geometric mean slowdown of 14% (including those possible outliers).


Short answer: By omitting the frame pointer,

You need to use the stack pointer to access local variables and arguments. The compiler doesn't mind, but if you are coding in assembler, this makes your life slightly harder. Much harder if you don't use macros.

You save four bytes (32-bit architecture) of stack space per function call. Unless you are using deep recursion, this isn't a win.

You save a memory write to a cached memory (the stack) and you (theoretically) save a few clock ticks on function entry/exit, but you can increase the code size. Unless your function is doing very little very often (in which case it should be inlined), this shouldn't be noticeable.

You free up a general-purpose register. If the compiler can utilize the register, it will produce code that is both substantially smaller and potentially faster. But, if most of the CPU time is spent talking to the main memory (or even the hard drive), omitting the frame pointer is not going save you from that.

The debugger will lose an easy way to generate the stack trace. The debugger might still be able to able to generate the stack trace from a different source (such as a PDB file).


Long answer:

The typical function entry and exit is (16-bit processor):

PUSH BP   ;push the base pointer (frame pointer)
MOV BP,SP ;store the stack pointer in the frame pointer
SUB SP,xx ;allocate space for local variables et al.
...
LEAVE     ;restore the stack pointer and pop the old frame pointer
RET       ;return from the function

An entry and exit without a frame pointer could look like (32-bit processor):

SUB ESP,xx ;allocate space for local variables et al.
...
ADD ESP,xx ;de-allocate space for local variables et al.
RET        ;return from the function.

You will save two instructions, but you also duplicate a literal value, so the code doesn't get shorter (quite the opposite, especially with [esp+xx] addressing modes taking an extra byte vs. [ebp+xx]), but you might have saved a few clock cycles (or not, if it causes a cache miss in the instruction cache). You did save some space on the stack, though.


You do free up a general-purpose register. This has only benefits.

In regcall/fastcall, this is one extra register where you can store arguments to your function. Thus, if your function takes seven (on x86; more on most other architectures) or more arguments (including this), the seventh argument still fits into a register. (Although most calling conventions don't pass that many in registers, e.g., two for MS fastcall, three for GCC regparm(3) on 32-bit x86. Up to six integer register arguments on x86-64 System V, or 4 register arguments on most RISC processors.)

The same, more importantly, applies to local variables as well. Arrays and large objects don't fit into registers (but pointers to them do), but if your function is using seven different local variables (including temporary variables needed to calculate complex expressions), chances are the compiler will be able to produce smaller code. Smaller code means lower instruction cache footprint, which means reduced miss rate and thus even less memory access (but Intel Atom has a 32K instruction cache, meaning that your code will probably fit anyway).

The x86 architecture features the [BX/BP/SI/DI] and [BX/BP + SI/DI] addressing modes. This makes the BP register an extremely useful place for a scaled array index, especially if the array pointer resides in the SI or DI registers. Two offset registers are better than one.

Utilising a register avoids memory access, but if a variable is worth storing in a register, chances are it will survive just as fine in an L1 cache (especially since it's going to be on the stack). There is still the cost of moving to/from the cache, but since modern CPUs do a lot move optimisation and parallelisation, it is possible that an L1 access would be just as fast as a register access. Thus, the speed benefit from not moving data around is still present, but not as enormous. I can easily imagine the CPU avoiding the data cache completely, at least as far as reading is concerned (and writing to cache can be done in parallel).

A register that is utilised is a register that needs preserving. It is not worth storing much in the registers if you are going to push it to the stack anyway before you use it again. In preserve-by-caller calling conventions (such as the one above), this means that registers as persistent storage are not as useful in a function that calls other functions a lot.

See What are callee and caller saved registers? for more about how calling conventions are designed with a mix of call-clobbered and call-preserved registers to give compilers a good mix of each, so functions have some scratch registers for temporaries that don't need to live across function calls, but also some registers that callees will preserve. Also Why make some registers caller-saved and others callee-saved? Why not make the caller save everything it wants saved?

Also note that x86 has a separate register space for floating point registers, meaning that floats cannot utilise the BP register without extra data movement instructions anyway. Only integers and memory pointers do.


You do lose debugability by omitting frame pointers. This answer show why:

If the code crashes, all the debugger needs to do to generate the stack trace is:

    PUSH BP      ; log the current frame pointer as well
$1: CALL log_BP  ; log the frame pointer currently on stack
    LEAVE        ; pop the frame pointer to get the next one
    CMP [BP+4],0
    JNZ $1       ; until the stack cannot be popped (the return address is some specific value)

If the code crashes without a frame pointer, the debugger might not have any way to generate the stack trace, because it might not know (namely, it needs to locate the function entry/exit point) how much needs to be subtracted from the stack pointer. If the debugger doesn't know the frame pointer is not being used, it might even crash itself.

Modern debug-info formats have metadata that still allows stack backtraces in optimized code where the compiler defaults to not using [E/R]BP as a frame pointer. Compilers know how to use assembler directives to create this extra metadata, or write it directly in the object file, not in the parts that normally get mapped into memory. If you don't do this for hand-written assembly, then debugability would suffer, especially for crashes in functions called by a hand-written assembly function.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
John Dvorak
  • 26,799
  • 13
  • 69
  • 83
  • When we debug using minidumps, the debugger can still use the PDB file to find out how to walk the stack. Isn't that sufficient? Does using framepointers help in those rare cases where the stack is empty (because there is a call or return to a location outside executable code)? – Patrick Oct 22 '12 at 09:53
  • The [wikipedia article about PDB](http://en.wikipedia.org/wiki/Program_database) links to [an article on MSDN](http://msdn.microsoft.com/en-us/library/ff558825(v=vs.85).aspx) which states that "symbol files might contain ... frame pointer omission (FPO) records". This indicates that the PDB file is sufficient to generate a stack trace. However, this might not be sufficient when the code jumps to a random location (calling a non-function). Good point - didn't know about PDB files. – John Dvorak Oct 22 '12 at 12:46
  • You may have PDBs for your *own* code but not for 3rd party code. Also I think the other win with FPO was freeing up a register for other use. – Mark Sowul Dec 08 '13 at 01:51
  • @MarkSowul I don't think this is a win. If the 3rd party code omits frame pointers _and_ lacks FPO records, when your app crashes, all you know is that it crashed in a 3rd party piece of code. But, you don't know whose code (if multiple 3rd parties commit to the same sin) and you don't know why - was it because of a bug in that code, or because you were sending it bogus pointers ("store my data into this (read-only section / place on the stack that used to belong to me\* / object I've recently freed), please")? – John Dvorak Dec 08 '13 at 07:11
  • \*of course, if you corrupt your stack, you aren't going to get a useful stacktrace in _any_ case, framepointers or not – John Dvorak Dec 08 '13 at 07:13
  • Yeah, to be more specific, if you use it, it really sucks for anyone who doesn't have your symbols (and I'm not sure whether public symbols will be enough?) However if the code doesn't crash, freeing up a register is good for performance. It's perhaps worth noting that MS no longer feels the performance is worth the impaired debuggability: http://blogs.msdn.com/b/larryosterman/archive/2007/03/12/fpo.aspx – Mark Sowul Dec 28 '13 at 21:25
  • 6
    To be honest, I expected the "short answer" to be much shorter, like "yes" or "no". – VP. Jan 30 '20 at 11:31
  • @PeterMortensen - "Geometric mean" is a single average, not plural. Most of the rest of your edit looks ok, although expanded wording on "register args on most RISCs" makes that parenthetical aside take up more space and attention in the answer than it should. Especially making RISC a link draws more attention to it than it deserves in a question about x86. It's not even a useful link, like to something about calling conventions on typical RISCs, it's just a definition of the term anyone could google from the word "RISC". But thanks for your editing efforts overall. – Peter Cordes Sep 01 '22 at 17:44
  • The cpu used in the Phoronix test mentioned at the beginning of this answer is ryzen 5 5500U cpu, which is not a Zen 3 cpu. You even included the link to wikipedia "Zen 3" page, but apparently you did not read that page, since that page does not list 5500U as a Zen 3 cpu. – RabidBear Oct 27 '22 at 19:30