How many ways-superscalar are modern Intel processors?

Question

I just learned about superscalar processors (https://en.wikipedia.org/wiki/Superscalar_processor).

I also learned that as the superscalar processor increase in width / number of ways, things get more complicated and complexity increases so fast that it's best to add more cores instead of more width. My instructor said that it stops being worth it to add more ways somewhere between 4-way and 8-way superscalar.

This got me wondering: where did Intel stop adding ways and start adding cores? How many ways are there in each core of my Intel 8th gen core i7?

Does the concept of ways even apply to these processors?

Peter Cordes · Accepted Answer · 2019-10-17T02:21:28.813

See Agner Fog's microarch guide for pipeline details like this. https://www.realworldtech.com/haswell-cpu/ and also a uarch deep-dive on Haswell with block diagrams. (And links to some of David Kanter's articles on other uarches, like SnB and Core2, and AMD Bulldozer and K8.) Also other links in https://stackoverflow.com/tags/x86/info

Yes, modern x86 cores are superscalar out-of-order execution. The fundamentals haven't changed since PPro: decode x86 machine code into micro-ops (uops) that can be scheduled by a ROB + RS.

(Terminology: Intel uses "issue" to mean "copy into the out-of-order back-end", and "dispatch" to mean "send from the scheduler to an execution unit", allocating resources and updating the RAT. In a lot of the rest of the computer-architecture field, people use the opposite terminology.)

Intel since Core 2 is 4 uops wide superscalar in the issue/rename/allocate stage, the narrowest bottleneck. (Before that, PPro to Pentium-M, it was 3-wide.) Core 2 could rarely sustain that in practice, with too many other bottlenecks. Skylake often can come very close in high-throughput code.

To get more work into each fused-domain uop, there's micro-fusion of an ALU uop with a load of its memory-source. And macro-fusion of e.g. cmp/test + jcc so compare-and-branch instructions together decode as one uop. (See Agner Fog's microarch guide). This includes your Kaby or Coffee Lake CPU. The max unfused-domain sustained throughput is 7 uops per clock, achievable in practice on Skylake. In a burst, the scheduler can dispatch uops to every port.

Ice Lake (Sunny Cove uarch) widens the issue stage to 5.

AMD Zen's is 6 uops wide, but only 5 instructions wide, so it can only achieve 6 uops/clock when running at least some 2-uop instructions. e.g. 256-bit AVX SIMD instructions which it decodes to 2x 128-bit halves (or worse for lane-crossing shuffles).

Skylake widened the legacy decoders to 5 uops/clock, and uop cache fetch to 6 uops / clock, up from 4/clock in SnB through Broadwell. This hides front-end bubbles more of the time and keeps the issue/rename stage fed with 4 uops per clock more of the time in high-throughput code. (There are buffers / queues between stages, e.g. the 64 uop IDQ that feeds the issue/rename stage.)

This includes your Kaby or Coffee Lake CPU: microarchitecturally the IA cores in KBL are identical to SKL, and Coffee Lake is a very minor tweak (fixing the loop buffer which SKL had to disable in a microcode update because of a partial-register merging uop erratum, aka CPU bug). KBL and CFL have better GPUs than SKL but the x86 cores are basically the same.

Yes, there are diminishing returns beyond 3 or 4-wide for most code, but SMT lets a wide core find the ILP in two (or 4 or 8) threads of execution at once. That makes wider cores not be wasted, but the cost of a core scales more than linearly with width so you only do it if sometimes a single thread can use most of that width. Otherwise you'd just build more smaller cores. (At least if you have a scaleable interconnect for more cores...) My answer on Why not make one big CPU core? on electronics.SE has more details about the tradeoffs and the limited ILP available in real workloads.

It's a good answer. I think it would be easier to extract the headline answer if you mentioned that the 4/5/6 rename bottleneck is the _narrowest bottleneck_ around where you mention it - since that's kind of the headline answer I think "4 wide because rename is 4-wide and it is the narrowest bottleneck, blah, blah short bursts, etc". — BeeOnRope, Oct 17 '19 at 00:05

How many ways-superscalar are modern Intel processors?

1 Answers1

Linked