How much overhead do actual Intel x86 processors had in order to pipeline?

Question

On my grad Computer Architecture, the professor talked about pipelining in MIPS, but also said that due to some situation with the x86 instruction set (which I can't quite remember), x86 processors need to have an added logic to pre-process the assembly instructions and pipeline.

I am not looking for a direct numeric answer, rather than for documentation or hints on the topic: what is being done to translate the x86 instructions to allow pipelining, how does this logic work, etc.

Thanks a bunch.

See https://en.wikipedia.org/wiki/Micro-operation The term is micro-ops or micro-operations. Also, anything by Agner Fog will be good, so see: http://www.agner.org/optimize/ — Craig Estey, Sep 29 '16 at 23:24

score 7 · Answer 1 · edited May 23 '17 at 12:00

Many forum threads over at http://realworldtech.com/ have debated how much the "x86 tax" costs x86 CPUs in terms of transistor count / performance / power, vs. a simple-to-decode ISA like MIPS.

10% is a number that has been thrown around as a wild guess. Some of that cost is fixed, and doesn't scale as you make the CPU more powerful. e.g. it takes maybe 3 extra pipeline stages to decode x86 instructions into a stream of uops that are similar in complexity to separate MIPS instructions. An ADD with a memory destination might decode into a load, ADD, and store. (micro-fusion in some parts of the pipeline makes it more complicated than that.)

Decoding variable-length x86 instructions is very power-intensive to do in parallel (up to 4 per clock in current designs). x86 isn't just variable-length, determining the length (i.e. the start of the next instruction) requires looking at a lot of bits because there are optional prefixes and various other complexities. Agner Fog's blog post about the "instruction set war" between Intel and AMD discusses some of the costs of the messy state of x86 opcode coding-space. (See also his microarch pdf to learn about the pipelines in modern x86 designs from AMD and Intel, aimed at finding bottlenecks in real code / understanding performance counters, but also interesting if you're just curious how CPUs work).

The cost of decoding x86 instructions is so high that Intel's Sandybridge microarchitecture family uses a small/fast decoded-uop cache as well as a tradition L1 I-cache. Even large loops usually fit in the uop cache, saving power and increasing front-end throughput vs. running from the legacy decoders. Most other ISAs can't get nearly as much benefit from a decoded-instruction cache, so they don't use them. (Intel previous experimented with a decoded-uop trace cache in Pentium 4 (without a L1 I-cache, and with weaker decoders), but SnB's uop cache is not a trace cache and the legacy decoders are still fast enough.)

OTOH, some of x86's legacy baggage (like partial FLAGS updates) imposes a cost on the rest of the pipeline and out-of-order core. Modern x86 CPUs do have to rename different parts of FLAGS separately, to avoid false dependencies in something like DEC / JNZ. (Where DEC doesn't modify CF). Intel experimented with not doing this (in Pentium4 aka the netburst microarchitecture family). They thought they could force everyone to recompile their code with compilers that avoided INC/DEC, and used add eax, 1 (which does modify all the flags). (This optimization advice stuck around for ages in their official optimization manual, long after P4 was obsolete, and many people think it's still relevant.)

Some people argue that x86's strong memory ordering semantics should be considered part of the "x86 tax" that reduces the parallelism a pipelined CPU can exploit, but others (e.g. Linus Torvalds) would argue that having the hardware do it for you means you don't need barrier instructions all over the place in multi-threaded code. And that having the barrier instructions be "cheap" (not a full flush of the store buffer or whatever) requires the hardware to track memory ordering in enough detail that they might as well just make barriers implicit.

I'd also add the MIPS instruction set has its own legacy baggage. The branch delay slots, for example, are tied to the classic 5 state pipeline, and don't make sense in modern designs. — Ross Ridge, Sep 30 '16 at 18:38
@RossRidge: Right, good point. Exposing microarchitectural details in the ISA always becomes a burden when things change. e.g. VLIW architectures can't very well get wider without a recompile. I never really learned IA-64, but apparently it exposes a lot. When this argument comes up, it's usually about ARM and how it will eventually take over the world once design and manufacturing of ARM CPUs improves, and Intel's process advantage can't pay for the x86 tax anymore. And the other side arguing that the x86 tax is a flat cost. I substituted MIPS based on this question. — Peter Cordes, Sep 30 '16 at 19:01

score 1 · Answer 2 · answered Sep 29 '16 at 23:06

The MIPS instruction set is a RISC instruction set which was designed to make it easy to pipeline, whereas the x86 instruction set just grew like a malignant tumor. Pipelined x86 implementations typically translate each x86 instruction into one or more RISC-like operations which are then pipelined.

How much overhead do actual Intel x86 processors had in order to pipeline?

2 Answers2