Bottleneck when using indexed addressing modes

Question

I performed the following experiments both on a Haswell and a Coffee Lake machine.

The instruction

cmp rbx, qword ptr [r14+rax]

has a throughput of 0.5 (i.e., 2 instructions per cycle). This is as expected. The instruction is decoded to one µop that is later unlaminated (see https://stackoverflow.com/a/31027695/10461973) and, thus, requires two retire slots.

If we add a nop instruction

cmp rbx, qword ptr [r14+rax]; nop

I would expect a throughput of 0.75, as this sequence requires 3 retire slots, and there also seem to be no other bottlenecks in the back-end. This is also the throughput that IACA reports. However, the actual throughput is 1 (this is independent of whether the µops come from the decoders or the DSB). What is the bottleneck in this case?

Without the indexed addressing mode,

cmp rbx, qword ptr [r14]; nop

has a throughput of 0.5, as expected.

Peter Cordes · Accepted Answer · 2021-04-07T23:37:10.347

It seems you've uncovered a downside to unlamination vs. regular multi-uop instructions, perhaps in the interaction with 4-wide issue/rename/allocate when a micro-fused uop reaches the head of the IDQ.

Hypothesis: maybe both uops resulting from un-lamination have to be part of the same issue group, so unlaminated; nop repeated only achieves a front-end throughput of 3 fused-domain uops per clock.

That might make sense if un-lamination only happens at the head of the IDQ, as they reach the alloc/rename stage. Rather than as they're added to the IDQ. To test this, we could see if LSD (loop buffer) capacity on Haswell depends on uop count before or after unlamination - @AndreasAbel's testing shows that a loop containing 55x cmp rbx, [r14+rax] runs from the LSD on Haswell, so that's strong evidence that unlamination happens during alloc/rename, not taking multiple entries in the IDQ itself.

For comparison, cmp dword [rip+rel32], 1 won't micro-fuse in the first place, in the decoders, so it won't un-laminate. If it achieves 0.75c throughput, that would be evidence in support of un-lamination requiring room in the same issue group.

Perhaps times 2 nop; unlaminate or times 3 nop could also be an interesting test to see if the unlaminated uop ever issues by itself or can reliably grab 2 more NOPs after it's delayed from whatever position in an issue group. From your back-to-back cmp-unlaminate test, I expect we'd still see mostly full 4-uop issue groups.

Your question mentions retirement but not issue.

Retire is at least as wide as issue (4-wide from Core2 to Skylake, 5-wide in Ice Lake).

Sandybridge / Haswell retire 4 fused-domain uops/clock. Skylake can retire 4 fused-domain uops per clock per hyperthread, allowing quicker release of resources like load buffers after one old stalled uop finally completes, if both logical cores are busy. It's not 100% clear whether it can retire 8/clock when running in single-thread mode, I found conflicting claims, and no clear statement in Intel's optimization manual.

It's very hard if not impossible to actually create a bottleneck on retirement (but not issue). Any sustained stream has to get through the issue stage, which is not wider than retirement. (Performance counters for uops_issued.any indicate that un-lamination happens at some point before issue, so that doesn't help us jam more uops through the front-end than retirement can handle. Unless that's misleading; running the same loop on both logical cores of the same physical core should have the same overall bottleneck, but if if Skylake runs it faster, that would tell us that parallel SMT retirement helped. Unlikely, but something to check if anyone wants to rule it out.)

This is also the throughput that IACA reports

IACA's pipeline model seems pretty naive; I don't think it knows about Sandybridge's multiple-of-4-uop issue effect (e.g. a 6 uop loop costs the same as 8). IACA also doesn't know that Haswell can keep add eax, [rdi+rdx] micro-fused throughout the pipeline, so any analysis of indexed uops that don't un-laminate is wrong.

I wouldn't trust IACA to do more than count uops and make some wild guesses about how they will allocate to ports.

According to https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(client)#Retirement retire is 4 µops per cycle and "allocation queue delivery" is 6 µops per cycle on Skylake. — Andreas Abel, Aug 06 '20 at 16:16
`cmp dword ptr [rip+0x8], 1; nop` achieves 0.75c (if the µops come from the DSB). — Andreas Abel, Aug 06 '20 at 16:21
`nop;nop;cmp rbx,QWORD PTR [r14+rax];` achieves 1c, `nop;nop;nop;cmp rbx,QWORD PTR [r14+rax];` achieves 1.33c. — Andreas Abel, Aug 06 '20 at 16:25
@AndreasAbel: SKL's uop cache can deliver up to 6 uops per clock to far end of the IDQ. The issue stage is definitely only 4 wide, competitively shared between SMT threads. (Unlike retirement where each SMT thread can retire independently. But for this sustained-throughput case, retirement width isn't a bottleneck even if it's only 4 wide, unless it also has weird effects from unlamination). That paragraph on wikichip seems misleading. Intel's optimization manual confirms Skylake is 4-wide, e.g. that Ice Lake widens allocation to 5, up from 4 since Core 2. — Peter Cordes, Aug 06 '20 at 16:37
The block diagram on wikichips https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(client)#Individual_Core is also misleading, as it has 6 arrows from the IDQ to the reorder buffer. But anyway, the point of mentioning the reorder slots in my question was to make it clear that the retire width is not the bottleneck for the `cmp rbx, qword ptr [r14+rax]; nop` case. — Andreas Abel, Aug 06 '20 at 16:46
@AndreasAbel: re: retirement, yes it seems you were right, I didn't find any clear + reliable claim that Skylake can retire more than 4 uops / clock from a single thread. The extra retirement bandwidth vs. Haswell is only in the form of 4 uops/clock from both hyperthreads at the same time. Unless it can do 8/clock on a single thread? I haven't ruled that out. Anyway, updated my answer to move that section to the bottom. — Peter Cordes, Aug 06 '20 at 17:25
On Haswell, a loop with 55 `cmp rbx, [r14+rax]` instructions runs from the LSD. So it appears that unlamination happens when/after the uops leave the IDQ. — Andreas Abel, Apr 07 '21 at 23:33

Bottleneck when using indexed addressing modes

1 Answers1

Linked