What is the maximum possible IPC can be achieved by Intel Nehalem Microarchitecture?

Question

Is there an estimation for the maximum Instructions Per Cycle achievable by the Intel Nehalem Architecture? Also, what is the bottleneck that effects the maximum Instructions Per Cycle?

Peter Cordes · Accepted Answer · 2022-10-17T11:26:08.797

TL:DR:

Intel Core, Nehalem, and Sandybridge / IvyBridge: a maximum of 5 IPC, including 1 macro-fused cmp+branch to get 5 instructions into 4 fused-domain uops, and the rest being single-uop instruction. (up to 2 of these can be micro-fused store or load+ALU.)

Haswell up to 9th Gens: a maximum of 6 instructions per cycle can be achieved using two pairs of macro-fusable ALU+branch instructions and two instructions that are decoded into two potentially micro-fused uops. The max unfused-domain uop throughput is 7 uops per clock, according to my testing on Skylake..

Early P6-family: Pentium Pro/PII/PIII, and Pentium M. Also Pentium 4: a maximum of 3 instructions per cycle can be achieved using 3 instructions that are decoded into 3 uops. (No macro-fusion, and 3-wide decode and issue).

The max IPC on Sunny Cove may be 7, thanks to increased front-end bandwidth of 5 uops per clock.

Source: Agner Fog's microarch pdf and instruction tables. Also see the x86 tag wiki.

The out-of-order pipeline in Intel Core2 and later can issue/rename 4 fused-domain uops per clock. This is the bottleneck. Macro-fusion will combine a cmp / jcc into a single uop, but this can only happen once per decode block. (Until Haswell).

Also decode is another important bottleneck before the uop-cache in SnB-family. (Up to 4 instructions into up-to-7 uops with a 4-1-1-1 pattern in Core2 and Nehalem; SnB-family is up-to 4 total, or up to 5 in Skylake, e.g. a 2-1-1-1 pattern from still only 4 decoders, not 5 as some sources incorrectly report). Multi-uop instructions have to decode in the first "slot". See Agner Fog's microarch guide for much more about the potential bottlenecks in Nehalem.

Nehalem InstLatx64 shows that nop surprisingly only has 0.33c throughput, not 0.25, but it turns out according to https://www.uops.info/table.html that's because nop needs an ALU execution unit in CPUs before Sandybridge. Agner Fog says he didn't detect a retirement bottleneck on Nehalem.

Even if you could arrange things so more than one macro-fused pair per 4 uops was in a loop, Nehalem has a throughput of only one fused test-and-branch uop per clock (port 5). So it couldn't sustain more than one macro-fused compare-and-branch per clock even if some of them are not-taken. (Haswell can run not-taken branches on port 0 or port 6, so 6 IPC throughput can be sustained as long as at least one of the macro-fused branches is not-taken.)

;; Should run at one iteration per clock
.l:
    mov   edx, [rsi]    ; doesn't need an ALU uop.  A store would work here, too, but a NOP need an ALU port on Nehalem.
    add   eax, edx
    inc   rsi
    cmp   rsi, rdi          ; macro-fuses
    jb   .l                 ; with this, into 1 cmp+branch uop

For ease of testing, and remove cache/memory bottlenecks, you could change it to load from the same location every time, instead of using the loop counter in the addressing mode. (As long as you avoid register-read stalls from too many cold registers.)

Note that pre-Haswell uarches only have three ALU ports. But mov loads or stores take pipeline bandwidth so there's a benefit to having 4-wide issue/rename. It's also useful for the front-end to be able to issue faster than the out-of-order core can execute, so there is always a buffer of work to do queued up in the scheduler, so it can find the instruction-level parallelism and get started on future loads early, and stuff like that.

I think other than load/store (including push/pop thanks to the stack engine), fxchg might be the only fused-domain uop that doesn't need an ALU port in Nehalem. Or maybe it actually does, like nop. On SnB-family uarches, xor same,same is handled in the rename/issue stage, and sometimes also reg-reg movs (IvB and later). nop is also never executed, unlike on Nehalem, so SnB/IvB have 0.25c throughput for nop even though they only have 3 ALU ports.
An eliminated mov reg,reg on Ivy Bridge can also be part of a loop that runs 4 front-end uops per clock with only 3 back-end ALU port.

For maxing out back-end uop throughput, you need micro-fusion to get 2 back-end uops (load + ALU) through the front-end as a single fused-domain uop in decode, issue/rename, and in the ROB. https://www.agner.org/optimize/blog/read.php?i=415#852

@Hadi: Thanks for the edit, but it introduced an error unless I'm mistaken. Haswell (not Skylake) added the extra not-taken-branch EU and support for 2 macro-fusions from one decode group, like I said later in this answer. Also, the ROB is *always* in the fused domain. The RS is fused on P6-family but not SnB-family; maybe that's what you were thinking of. — Peter Cordes, Aug 07 '19 at 13:25
Haswell does have two branch execution units, but are you sure it supported two macro-fusions per cycle? I just ran a quick test with a loop that contains a sequence of two `add` instructions and two macro-fusable `cmp/jcc` instructions (one is always not taken and one is always taken for loop control). `UOPS_ISSUED_ANY` shows that there 4 uops per iteration (confirms that macro-fusion works) but the throughput is 1.7 cycles per iteration. I think this indicates that Haswell can do only one macro-fusion per cycle, unless I'm missing something. — Hadi Brais, Aug 07 '19 at 21:58
Also I remember that the ROB in SnB and/or NHM is not the fused domain, but I don't currently have access to an SnB system to test it, so I'm not really sure. I think there is a chance that the ROB was not always in the fused domain. — Hadi Brais, Aug 07 '19 at 22:00
@HadiBrais: I don't have a HSW available for testing :/ A tight loop doesn't get re-decoded every iteration unless you overflowed the uop cache with NOPs or something. So 4 uops confirms that your loop decoded with both branches fused. If you precede the loop with a multi-uop insn, that makes sure the loop hits the decoders as a group. It doesn't have to re-fuse them every iteration, just issue from the LSD and execute the uops. Aligning the loop entry point by 64 (with the last instruction of padding being multi-uop) might help rule out front-end weirdness. — Peter Cordes, Aug 07 '19 at 22:05
@HadiBrais: oh yes, I'd forgotten about SnB/IvB probably having an unfused ROB, like you said in [Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths](//stackoverflow.com/posts/comments/93637873). Nehalem though even has a fused-domain RS so I'd assume the ROB is the same. — Peter Cordes, Aug 07 '19 at 22:07
@HadiBrais: But even when the ROB is unfused-domain, I think the front-end throughput is still 4 fused-domain uops per clock. So in any given cycle, the issue stage is capable of writing 8 each ROB and RS entries, if all the uops are micro-fused load+ALU or stores. (Sustainable as part of a larger loop, or with 2 micro-fused uops in a 4-uop loop). I haven't tested this conclusion to see if my theory matches reality, but we do know that SKL's unfused RS is not a bottleneck for my 7 uop/clock loop. It's not like unlamination. — Peter Cordes, Aug 07 '19 at 22:29
@HadiBrais: Given that we know the LSD in SnB/IvB doesn't "unroll", we could make an 8 uop loop with the first 4 micro-fused and the last 4 uops not (probably 3x `nop` and a `dec/jnz`), and we'd know that to run at 2c / iter it would have to issue a group of 4 micro-fused uops. — Peter Cordes, Aug 07 '19 at 22:32
Oh yes I forgot about alignment. After aligning the loop on a 64-byte boundary, I get an IPC of 6. Thank you for pointing that out. And yes, it doesn't have to support two macro-fusions per cycle as long as it can feed 4 uops per cycle from the uop cache or the LSD. Agner's guide does say in Section 10.6 that HSW and BDW support two macro-fusions in a single cycle. — Hadi Brais, Aug 07 '19 at 22:44
@HadiBrais: usually the LSD makes alignment not matter as long as all the uops can fit in the uop cache. Were you using a lot of small NOPs or something that would have needed more than 3 uop cache lines for a 32-byte block? — Peter Cordes, Aug 07 '19 at 22:47
I wasn't using any NOPs; just two `add`s and two `cmp/jmp`s. However, one of the two `add`s includes a load uop for the source operand, which seems to be another factor that causes the IPC to be below 6. Overall, the mis-aligment was wasting about 5% of total slots and the load uop was wasting about 20% of total slots. I had to fix both to achieve an IPC of 6. — Hadi Brais, Aug 07 '19 at 23:03
@HadiBrais: In https://www.agner.org/optimize/blog/read.php?i=415#857 I found some possible bottlenecks on register-reads in HSW and SKL. But I still don't see how does misalignment wastes slots, unless it prevents the LSD from "locking down" the group of 4 fused-domain uops to issue repeatedly. They don't have to come from the same uop cache line or the same 64-byte L1i line. Maybe check `lsd.uops` or related front-end counters. Or not if you're not that curious about it. Of course spanning a 64-byte boundary in decode can block macro-fusion, but you said you saw 4 fused-domain uops. — Peter Cordes, Aug 07 '19 at 23:09

What is the maximum possible IPC can be achieved by Intel Nehalem Microarchitecture?

1 Answers1

Linked

Related