Why does this loop take 1.32 cycles per iteration

Question

Consider this simple C++ function to calculate the prefix sum of an array:

void prefix_sum(const uint32_t* input, uint32_t* output, size_t size) {
    uint32_t total = 0;
    for (size_t i = 0; i < size; i++) {
        total += input[i];
        output[i] = total;
    }
}

The loop compiles to the following assembly on gcc 5.5:

.L5:
        add     ecx, DWORD PTR [rdi+rax*4]
        mov     DWORD PTR [rsi+rax*4], ecx
        add     rax, 1
        cmp     rdx, rax
        jne     .L5

I don't see anything that would prevent this from running at 1 cycle per iteration, yet I consistently measure it at 1.32 (+/- 0.01) cycles/iteration on my Skylake i7-6700HQ, when running it against 8 KiB input/output arrays.

The loop is served out of the uop cache and doesn't cross any uop cache boundary and performance counters don't indicate any front-end bottleneck.

It's 4 fused uops¹, and this CPU can sustain 4 fused ops/cycle.

There are carried dependency chains through ecx and rax, each of 1 cycle, but these add uops can go to any of the 4 ALU ports, so seem unlikely to conflict. The fused cmp needs to go to p6 which is more of a concern, but I measure only 1.1 uops/iteration to p6. That would explain 1.1 cycles per iteration, but not 1.4. If I unroll the loop by 2x port pressure is much lower: less than 0.7 uops to all of p0156, yet performance is still unexpectedly slow at 1.3 cycles per iteration.

There is one store per iteration, but we can do one store per cycle.

There is one load per iteration, but we can do two of those per cycle.

There are two complex AGUs per cycle, but we can do two of those per cycle.

What's the bottleneck here?

Interestingly I tried the Ithermal performance predictor and it gets it almost exactly right: estimating 1.314 cycles versus my measurement of 1.32.

¹ I confirmed macro and micro-fusion fusion via the uops_issued.any counter which counts in the fused domain and reads 4.0 fused uops per iteration for this loop.

Did you check for 4k aliasing? I'll test-run it on my desktop if you have a handy MCVE caller for it. — Peter Cordes, Aug 12 '19 at 20:52
@PeterCordes I checked that `ld_blocks_partial.address_alias` reports a low figure and doesn't increase with problem size. Both arrays are aligned to 2 MiB. Yeah, I should provide an MCVE, but it's a bit of work since the current benchmark is spread across a dozen files, but I'll get it up at some point. — BeeOnRope, Aug 12 '19 at 20:57
I tried various relative offsets between the two array and didn't get any obvious performance improvement. — BeeOnRope, Aug 12 '19 at 21:07
I put the gcc asm into a NASM `.asm` file with a caller that passes BSS arrays. https://godbolt.org/z/D8HDOh With a large enough repeat count, the initial page-fault cost is amortized to negligible. Since `add` performance isn't data-dependent there's no need to init them. Anyway, I get an average of 3.81 instructions per cycle, not 5.0, so I see the same thing on i7-6700k. After supper I'll try indexing from the ends of the arrays and counting up towards zero, removing 1 uop of loop overhead. — Peter Cordes, Aug 12 '19 at 21:36
@PeterCordes 4k aliasing cannot occur here because the aliasing load is before the store in program order, not after. @Bee, check whether `CYCLE_ACTIVITY.STALLS_MEM_ANY` is high. If it is, it probably means that loads are not completing at a rate of 1 per cycle. In this case, the load buffer will become full, resulting in resource stalls. — Hadi Brais, Aug 12 '19 at 21:46
@HadiBrais - note that the input and output arrays are distinct, so it is possible for aliasing to occur, it depends on the relative offset between `input` and `output`. At a relative offset of 0 (mod 4k) no aliasing occurs because the load occurs occurs immediately before the store the same (mod 4k) location. If output is 4 bytes ahead of input (mod 4k), a ton of aliasing occurs. My test has a relatively alignment of 0 (mod 4k), however, so no aliasing occurs. — BeeOnRope, Aug 12 '19 at 21:54
@HadiBrais: I get 2.5 million counts for `CYCLE_ACTIVITY.STALLS_MEM_ANY:u` out of 2.7 billion cycles. So it's not high but non-zero. (Without restricting to user-space only, it's about 4.2M). But `resource_stalls.sb:u` is about 70k to 90k and noisy, lower by a factor of ~30. So store bottlenecks are probably just noise. — Peter Cordes, Aug 12 '19 at 22:15
2.5 _million_ out of 2.7 _billion_, i.e., less than 0.1%. I see similar numbers. — BeeOnRope, Aug 12 '19 at 22:26
Indexing from the end of the arrays (but still with the same complex addressing modes) https://godbolt.org/z/plHeVn speeds us up by ~1.2x, running at ~1.08 cycles per branch. (Including the call/ret because I'm lazy) — Peter Cordes, Aug 12 '19 at 22:33
I wonder if there's some kind of register-read limit. e.g. https://www.agner.org/optimize/blog/read.php?i=415#857 also demonstrates that reading more registers (or using complex addressing modes?) slows down Skylake. So the speedup from my change might have been from eliminating one register from the loop condition. — Peter Cordes, Aug 12 '19 at 22:36
I noticed that p4 counts are higher than 1 per iteration and close to the cycles/iteration, i.e., can explain most of the performance difference. For example an unrolled version of the original runs at 1.26 cycles/iteration and shows 1.25 uops/iteration to p4. Indicates that perhaps the stores are being replayed because their operand is not ready? More likely that it is a symptom than the cause though. — BeeOnRope, Aug 12 '19 at 22:40
Try storing from a cold register, or the loop counter, so the store data doesn't depend on a load result. — Peter Cordes, Aug 13 '19 at 20:41

score 4 · Answer 1 · edited Aug 21 '19 at 00:29

4

I just played around with the instructions on Ithermal Performance predictor and i might have found the issue. Trying out

add     ecx, DWORD PTR [rdi]
mov     DWORD PTR [rsi], ecx
add     rax, 1
cmp     rdx, rax

gives stunning 1.131 cycles per iteration. Cross checking with adding 0 in each iteration (which gives again 1.3 cycles) eliminates the possibility of a store/load bottleneck. Which finally suggests an issue with the addressing modes.

(editor's note: this is interesting experimental data, matching what I posted in the thread on Agner Fog's blog which the guess below misinterprets. Simpler addressing modes speed it up even though there's no unlamination.)

(editor's note: this part is wrong: we know from the question there's no un-lamination because uops_issued.any = 4 per iteration.)

I think your CPU un-laminates your add/mov in case of indexed addressing. This behavior is well documented for several Architectures (SnB, SKL, HWL) and someone did a great job on stackoverflow describing the whole thing: https://stackoverflow.com/a/31027695/1925289 In short: if too many registers & flags are involved, the fused op (DSB) gets un-laminated (IDQ) thus effectively un-fused again.

Other resources:

Ad fusing limits: https://www.agner.org/optimize/blog/read.php?i=415#852
Unlamination: https://easyperf.net/blog/2018/02/15/MicroFusion-in-Intel-CPUs#unlamination-example-1

edited Aug 21 '19 at 00:29

Peter Cordes

328,167
45
605
847

answered Aug 20 '19 at 09:24

Zacharias

576
3
6

BeeOnRope said in the question that he confirmed the loop was 4 fused-domain uops using performance counters. So that rules out unlamination. That's also not what my post was about on Agner Fog's blog thread, it was about unfused-domain uop *throughput* limits, and/or register-read throughput limits. Not limits on how much fusion is possible. I found there on both HSW and SKL that reducing the number of input registers was helpful, indicating that there's some other unknown microarchitectural limit, exactly like you've demonstrated by reading fewer regs. – Peter Cordes Aug 20 '19 at 09:35
So yes, the complex addressing modes are a problem, but probably only because of the extra inputs for each uop. Possibly also because of the dependency on a recently-incremented RAX, but unlikely. Anyway, we know HSW and SKL can keep those add+load and mov-store uops micro-fused, and context outside of the instruction doesn't affect that. – Peter Cordes Aug 20 '19 at 09:37
un-lamination takes place after DSB. You sure uops_issued.any counts for that? – Zacharias Aug 20 '19 at 09:46
Yes, that's the counter I used to test stuff while writing the SO answer you linked to about un-lamination! Agner Fog measured that micro-fusion still happens in the decoders by looking at uop-cache packing effects, I think. Or decode bottlenecks from blocks of code far too big for the DSB. That's why he missed discovering unlamination in his uarch guide update. Anyway, instead of downvoting, I added notes to your answer separating the test data from the incorrect guess. Hopefully that's more useful to future readers. – Peter Cordes Aug 20 '19 at 09:48
`uops_issued.any` counts uops as they're copied (and register-renamed) from the IDQ into the ROB and RS. This is at the "back" end of the IDQ, far from the DSB (uop cache). Equivalently when the are no branch-misses or other mis-speculation, you can also count `uops_retired.retire_slots`, IIRC. I'm not sure whether un-lamination happens at the "front" of the IDQ, or at the back (in the issue stage). So IDK whether un-lamination limits the size of the loopback buffer when the LSD recycles uops from the IDQ. But it's definitely before / during issue. – Peter Cordes Aug 20 '19 at 09:51
1

@PeterCordes - I have my doubts that register-read restrictions (as you describe on Agner's blog) are involved here. First, it doesn't seem like there are enough registers read, and also the effect persists (but is smaller) if you unroll by 2x. With the 2x unroll, there are definitely not many registers read and the required IPC is something like 3 rather than 4, which also help eliminate "too many uops" theories (like the unlamination one). In general, unrolling keeps reducing the delta versus expected 1.0 cycles/iter, even at 4x unroll it is still at 1.07 iters/cycle (ish). – BeeOnRope Aug 20 '19 at 19:30
@BeeOnRope: Interesting. reg-read was definitely a wild guess. I didn't test that idea by making predictions or finding a consistent limit that differently-constructed loops would hit. Your test definitely casts doubt on that hypothesis. Maybe it's not reading but renaming that's limited? Like read-ports on the RAT. But if a group of 4 uops can't always rename in a single, how could we get throughput better than 2c / iter? Have you looked at uops_dispatched to check for any uop replays causing a bottleneck on p2 or p3? An indexed store can't use port 7, requiring p23 saturation. – Peter Cordes Aug 21 '19 at 00:38
1

I want to note for future readers that the bounty was auto-assigned here, as the (only) answer with the most upvotes, but it does not answer the question. The bounty assignment is not an endorsement. – BeeOnRope Aug 24 '19 at 22:22

Why does this loop take 1.32 cycles per iteration

1 Answers1