31

I'm writing a JIT compiler with an x86 backend and learning x86 assembler and machine code as I go. I used ARM assembler about 20 years ago and am surprised by the difference in cost models between these architectures.

Specifically, memory accesses and branches are expensive on ARM but the equivalent stack operations and jumps are cheap on x86. I believe modern x86 CPUs do far more dynamic optimizations than ARM cores do and I find it difficult to anticipate their effects.

What is a good cost model to bear in mind when writing x86 assembler? Which combinations of instructions are cheap and which are expensive?

For example, my compiler would be simpler if it always generated the long form for loading integers or jumping to offsets even if the integers were small or the offsets close but would this impact performance?

I haven't done any floating point yet but I'd like to get on to it soon. Is there anything not obvious about the interaction between normal and float code?

I know there are lots of references (e.g. Michael Abrash) on x86 optimization but I have a hunch than anything more than a few years old will not apply to modern x86 CPUs because they have changed so much lately. Am I correct?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
J D
  • 48,105
  • 13
  • 171
  • 274
  • 1
    Which x86 implementation are you interested in? – harold Mar 31 '12 at 16:52
  • @harold Anything you'd find in a laptop, desktop or server today. So I think SSE3 is a given. I'd like generic advice about optimizing for all of them as well as specifics about any surprises I might find, e.g. an instruction that is 10x slower on the Atom. – J D Mar 31 '12 at 18:01
  • 1
    Conroe and it derivatives (Nehalem, Sandy Bridge) are as different from Atom as they are different from ARM. The principles of optimizing for them are the same as for the P6, so some older texts are valid. – Z.T. Apr 02 '12 at 08:29
  • See also several performance-related links in the [x86 tag wiki](http://stackoverflow.com/tags/x86/info). – Peter Cordes Aug 08 '16 at 00:53
  • 2
    See [What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?](https://stackoverflow.com/q/51607391) for more about static performance analysis on modern x86. – Peter Cordes Sep 24 '18 at 07:13
  • Also [How many CPU cycles are needed for each assembly instruction?](https://stackoverflow.com/a/44980899) for a more beginner-friendly discussion of what is effectively a cost model for modern x86 CPUs (and other superscalar / OoO exec CPUs). – Peter Cordes Apr 26 '22 at 07:34

5 Answers5

38

The best reference is the Intel Optimization Manual, which provides fairly detailed information on architectural hazards and instruction latencies for all recent Intel cores, as well as a good number of optimization examples.

Another excellent reference is Agner Fog's optimization resources, which have the virtue of also covering AMD cores.

Note that specific cost models are, by nature, micro-architecture specific. There's no such thing as an "x86 cost model" that has any kind of real validity. At the instruction level, the performance characteristics of Atom are wildly different from i7.

I would also note that memory accesses and branches are not actually "cheap" on x86 cores -- it's just that the out-of-order execution model has become so sophisticated that it can successfully hide the cost of them in many simple scenarios.

Stephen Canon
  • 103,815
  • 19
  • 183
  • 269
  • 1
    Thanks! "the performance characteristics of Atom are wildly different from i7". Can you cite something with more information about this? – J D Mar 31 '12 at 16:06
  • @JonHarrop more information than in Agner Fog's Microarchitectures document? I would be surprised if more information has been made public at all – harold Mar 31 '12 at 16:52
  • 5
    @JonHarrop: A modern i7 core is out-of-order and can sustain retiring 4 instructions per cycle. An Atom core is strictly in-order and can retire 2 instructions per cycle in ideal circumstances, but use of some instructions restrict it to only 1 ipc. This is all detailed in both Intel's document and in Agner's notes. From a very high-level architectural perspective, Atom is more similar to, say, an ARM Cortex-A8 than to other modern x86 cores. – Stephen Canon Mar 31 '12 at 17:09
  • 1
    +1 for Agner Fog. I prefer his optimization manuals to Intels ;-) – Gunther Piez Mar 31 '12 at 21:02
  • 1
    @drhirsch: they both have their merits. In my experience, Intel's are more likely to have *omissions*, whereas Agner's are more likely to have *errors* (Agner is quite good about correcting errors, to his credit). – Stephen Canon Mar 31 '12 at 21:46
5

Torbjörn Granlund's Instruction latencies and throughput for AMD and Intel x86 processors is good too.

Edit

Granlund's document concerns instruction throughput in the context of how many instructions of a certain type can be issued per clock cycle (i e performed in parallell). He also claims that intel's documentation isn't always accurate.

Olof Forshell
  • 3,169
  • 22
  • 28
2

Of course, Agner Fog's reports and the Intel® 64 and IA-32 Architectures Optimization Reference Manual are both necessary and excellent references. AMD also has an optimization manual:

  • Software Optimization Guide for AMD Family 15h Processors

However, two Intel tools are essential in understanding code sequences:

  • Intel® Architecture Code Analyzer
  • Intel® VTune™

IACA is your cost model. I use it on OSX but VTune only runs on Windows and Linux.

You can also dig into the Intel patent literature and various Intel papers to better understand how things work:

  • The Next-Generation Intel Core Microarchitecture
  • Haswell: The Fourth-Generation Intel Core Processor
  • Micro-operation cache: A power aware frontend for variable instruction length ISA
Olsonist
  • 2,051
  • 1
  • 20
  • 35
  • IACA and VTune are tools you could use while *tuning* a cost model, but actually using fork/execing IACA to test a sequence of instructions seems like it would be too slow for an optimizing compiler to do on every basic block, unless it reserved that for hot loops. VTune is mostly a tool for reading perf counters, which means you have to actually execute the instructions you're producing. That only works with `-mtune=native`; tuning for the host that's doing the compiling. – Peter Cordes Aug 08 '16 at 00:52
  • IACA is a static analysis tool. Your code doesn't even run. You wrap code with a prefix and a suffix and run the tool iaca -64 -arch HSW -ignore true -analysis LATENCY prog >lst %macro START_MARKER 0 mov ebx, 111 db 0x64, 0x67, 0x90 %endmacro %macro END_MARKER 0 mov ebx, 222 db 0x64, 0x67, 0x90 %endmacro – Olsonist Aug 08 '16 at 01:10
  • I've used IACA before, even posted SO answers including IACA output :P. My point was that using it *as* your cost model would mean the compiler actually invokes IACA on all the different possible implementations for a loop. Since IACA is closed source and only distributed as an executable, not a library, you would need to write an object file and fork/exec IACA. (Yes, I realize this is not what you meant, and that this is ridiculous. Just taking your wording literally :P) Upvoted for being useful for tuning in general, or for *tuning* a compiler's cost model. – Peter Cordes Aug 08 '16 at 01:20
  • I think the question wasn't looking for something which the JIT uses during its compilation but rather something the JIT engineer uses during development. IACA generates a report. – Olsonist Aug 08 '16 at 01:33
  • Yes, but how can you "use it as your model"? It reports latency/throughput for specific sequences of code. You could reverse engineer it to extract the logic it uses to figure those things out, but I wouldn't call that literally using it as your model. So I think we agree, that the only sane way to use it is *tuning* your compiler's cost model until it accurately predicts things the same as IACA. – Peter Cordes Aug 08 '16 at 01:36
  • It would be easier to read Agner Fog's microarch guide. Other than a few disagreements between Agner's insn tables and IACA's uop counts, I generally get the same results from static analysis by hand or with IACA. So if you want to implement your own code that evaluates the cost of a sequence, starting from Agner's guides would make more sense. IACA has very limited (or no) support for taken / not-taken branches inside inner loops, so it's not really directly usable for evaluating all loops anyway. I was just taking your "use it as your cost model" phrasing to the logical extreme. – Peter Cordes Aug 08 '16 at 01:39
1

It's worth looking at the backends existing open source compilers such as GCC and LLVM. These have models for instruction costs and also decent (but idealized) machine models (eg, issue width, cache sizes, etc).

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
0

I'm writing a JIT compiler with an x86 backend and learning x86 assembler and machine code as I go.

The essential problem here is that a JIT compiler can't afford to spend a huge amount of time micro-optimising. Because "optimising" happens at run-time, the cost of doing optimisations needs to be less than the time saved by the optimisations (otherwise the optimisation becomes a net loss in performance).

For 80x86 there are multiple different CPUs with different behaviour/characteristics. If you take the actual CPU's specific characteristics into account, then the cost of doing the optimisation increases and you slam directly into a "costs more than you gain" barrier. This is especially true for things like "ideal instruction scheduling".

Fortunately, most (but not all) modern 80x86 CPUs have various features (out-of-order, speculative execution, hyper-threading) to mitigate (some of) the performance costs caused by "less than perfect" optimisation. This tends to make expensive optimisations less beneficial.

The first thing you're going to want to do is identify which pieces of code should be optimised and which pieces shouldn't. Things that aren't executed frequently (e.g. "only executed once" initialisation code) should not be optimised at all. It's only frequently executed pieces (e.g. inner loops, etc) where it's worth bothering. Once you've identified a piece that's worth optimising the question then becomes "how much?".

As a crude over-generalisation; I'd expect that (on average) 90% of the code isn't worth optimising at all, and for 9% of the code it's only worth doing some generic optimisation. The remaining 1% (which could benefit from extensive optimisation in theory) will end up being too much hassle for the JIT compiler developer to bother with in practice (and would result in a massive complexity/verifiability nightmare - e.g. "bugs that only exist when running on some CPUs" scenarios).

Brendan
  • 35,656
  • 2
  • 39
  • 66