Why is my loop much faster when it is contained in one cache line?

Question

When I run this little assembly program on my Ryzen 9 3900X:

_start:   xor       rax, rax
          xor       rcx, rcx
loop0:    add       rax, 1
          mov       rdx, rax
          and       rdx, 1
          add       rcx, rdx
          cmp       rcx, 1000000000
          jne       loop0

It completes in 450 ms if all the instructions between loop0 and up to and including the jne, are contained entirely in one cacheline. That is, if:

round((address of loop0)/64) == round((address of end of jne-instruction)/64)

However, if the above condition does not hold, the loop takes 900 ms instead.

I've made a repo with the code https://github.com/avl/strange_performance_repro .

Why is the inner loop much slower in some specific cases?

Edit: Removed a claim with a conclusion from a mistake in testing.

Jérôme Richard · Accepted Answer · 2020-03-08T13:34:46.190

3

Your issue lies in the variable cost of the jne instruction.

First of all, to understand the impact of the effect, we need to analyze the full loop itself. The architecture of the Ryzen 9 3900X is Zen2. We can retrieve information about this on the AMD website or also WikiChip. This architecture have 4 ALU and 3 AGU. This roughly means it can execute up to 4 instructions like add/and/cmp per cycle.

Here is the cost of each instruction of the loop (based on the Agner Fog instruction table for Zen1):

                                        # Reciprocal throughput
loop0:    add       rax, 1              # 0.25
          mov       rdx, rax            # 0.2
          and       rdx, 1              # 0.25
          add       rcx, rdx            # 0.25
          cmp       rcx, 1000000000     # 0.25  | Fused  0.5-2 (2 if jumping)
          jne       loop0               # 0.5-2 |

As you can see, the 4 first computing instructions of the loop can be executed in ~1 cycle. The 2 last instructions can be merged by your processor into a faster one. Your main issue is that this last jne instruction could be quite slow compared to the rest of the loop. So you are very likely to measure only the overhead of this instruction. From this point, things start to be complicated.

Engineer and researcher worked hard the last decades to reduce the cost of such instructions at (almost) any price. Nowadays, processors (like the Ryzen 9 3900X) use out-of-order instruction scheduling to execute the dependent instructions required by the jne instruction as soon as possible. Most processors can also predict the address of the next instruction to execute after the jne and fetch new instructions (eg. the one of the next loop iteration) while the other instruction of the current loop iteration are being executed. Performing the fetch as soon as possible is important to prevent any stall in the execution pipeline of the processor (especially in your case).

From the AMD document "Software Optimization Guide for AMD Family 17h Models 30h and Greater Processors", we can read:

2.8.3 Loop Alignment:

For the processor loop alignment is not usually a significant issue. However, for hot loops, some further knowledge of trade-offs can be helpful. Since the processor can read an aligned 64-byte fetch block every cycle, aligning the end of the loop to the last byte of a 64-byte cache line is the best thing to do, if possible.
2.8.1.1 Next Address Logic

The next-address logic determines addresses for instruction fetch. [...]. When branches are identified, the next-address logic is redirected by the branch target and branch direction prediction hardware to generate a non-sequential fetch block address. The processor facilities that are designed to predict the next instruction to be executed following a branch are detailed in the following sections.

Thus, performing a conditional branching to instructions located in another cache line introduces an additional latency overhead due to a fetch of the Op Cache (an instruction cache faster than the L1) not required if the whole loop fit in 1 cache line. Indeed, if a loop is crossing a cache line, 2 line-cache fetches are required, which takes no less than 2 cycles. If the whole loop is fitting in a cache-line, only one 1 line-cache fetch is needed, which take only 1 cycle. As a result, since your loop iterations are very fast, paying 1 additional cycle introduces a significant slowdown. But how much?

You say that the program takes about 450 ms. Since the Ryzen 9 3900X turbo frequency is 4.6 GHz and your loop is executed 2e9 times, the number cycles per loop iteration is 1.035. Note that this is better than what we can expected before (I guess this processor is able to rename registers, ignore the mov instruction, execute the jne instruction in parallel in only 1 cycle while other instructions of the loop are perfectly pipelined; which is astounding). This also shows that paying an additional fetch overhead of 1 cycle will double the number of cycles needed to execute each loop iteration and so the overall execution time.

If you do not want to pay this overhead, consider unrolling your loop to significantly reduce the number of conditional branches and non-sequential fetches.

This problem can occur on other architectures such as on Intel Skylake. Indeed, the same loop on a i5-9600KF takes 0.70s with loop alignment and 0.90s without (also due to the additional 1 cycle fetch latency). With a 8x unrolling, the result is 0.53s (whatever the alignment).

edited Mar 08 '20 at 13:34

answered Mar 08 '20 at 13:27

Jérôme Richard

41,678
6
29
59

Oh man, thank you so much for this excellent answer! It makes perfect sense. I agree that the Ryzen 9 performance in this particular loop (when it is aligned) is extremely impressive. Doesn't each instruction have a data-dependency on the instruction before it? How on earth can it possibly run quicker than 1 cycle per instruction unless instructions are somehow merged? The reciprocal throughput of 0.25 normally only holds when data being operated on is already available, right? – avl_sweden Mar 08 '20 at 15:11
1

Today processors are insanely complex and "clever". They can execute multiple loop iterations speculatively. They can track dependencies between the instructions (of possibly different loop iterations) and *execute independent instruction in parallel*. Here, the processor succeed to pipeline different loop iterations. For example, the instructions `jne loop0` of iter 1, `add rcx, rdx` of iter 2, `and rdx_2, 1` of iter 3, `add rax, 1` of iter 4 can be executed in parallel (with `rdx` of iter 2 and 3 stored in different internal registers due to the renaming done by the processor itself). – Jérôme Richard Mar 08 '20 at 19:23
1

For the last question: all instructions have a latency of at least 1 cycle, and yes, instructions must wait for data to be available. A reciprocal throughput of 0.25 means that 4 instructions of a given type can run simultaneously (working on different registers, eventually renamed before). – Jérôme Richard Mar 08 '20 at 19:30
Yes, Zen has mov-elimination for GPRs, not just vector registers. The loop is 5 uops for the front-end, but only needs 4 back-end ALU execution units per iteration. This allows 1/clock throughput, assuming perfect scheduling so the 1-cycle dependency chains don't ever miss a cycle. (Even on CPUs where taken branches in general are worse than 1/clock throughput, loop branches are usually special somehow for tight loops.) – Peter Cordes Jul 31 '23 at 19:34
[On Skylake](https://stackoverflow.com/questions/45298870/why-does-loop-alignment-on-32-byte-make-code-faster), having a tiny loop split across two 32-byte blocks means its instructions will go in two separate uop-cache lines, making best-case front-end throughput 2 cycles / iter. (Before the microcode update that disabled the loop buffer (LSD), Skylake didn't have this problem; it could just lock down the loop in the IDQ (queue that feeds the issue stage) and replay its uops from there until a mispredict, even if the loop spanned a 64-byte boundary.) – Peter Cordes Jul 31 '23 at 19:37
I wonder if the actual reason in Zen 2 is more related to uop-cache lines than `jcc` throughput? Performance counters should show if it's a front-end or back-end bottleneck (ROB and/or RS full vs. empty) https://en.wikichip.org/wiki/amd/microarchitectures/zen_2#Front_End doesn't mention a loop buffer, so it might well be 1 vs. 2 uop-cache fetches. – Peter Cordes Jul 31 '23 at 19:38
@PeterCordes I do not have a Zen2 processor anymore to make tests (the post is 3 years old now). That being said, I still have my i5-9600KF processor and it still give similar timings. My microcode version is `sig=0x906ec, pf=0x2, revision=0xae`. I do not expect to have a recent version (certainly the default one), but it is not clear to me how to get the last version number for this processor. I am not convinced jcc can be the issue on Zen2 but it might on Coffee Lake. Do you know which performance counter can be used to see JCC issues on CoffeeLake? – Jérôme Richard Aug 04 '23 at 19:21
@PeterCordes So far, the only few counters show a relevant value with a significant change between the slow and fast version. `icache_64b.iftag_hit` : it is twice bigger in the slow version (1.0/cycle for the slow version and 0.62/cycle for the fast one -- matching with the fact that more cache lines need to be fetched because the loop fits on 2 cache lines). The `idq_uops_not_delivered.xxx` show that the front-end clearly does not send enough uops half the time in the slow version (frontend-bound) while it is nearly always fine with the fast version (mainly backend-bound). – Jérôme Richard Aug 04 '23 at 19:52
Coffee Lake behaves the same as Skylake: LSD disabled. The JCC *erratum* shouldn't be a problem unless the `jcc` touches the last byte of a 32-byte chunk. (To detect that, look for `idq.mite_uops` being a significant fraction of total uops issued, instead of tiny.) To detect that your uops are coming from two separate cache lines, perhaps `idq.all_dsb_cycles_4_uops` vs. `idq.all_dsb_cycles_any_uops` - many cycles with fewer than 4 uops is a problem. (I guess there's buffering between the actual DSB fetch and delivering to the IDQ, since SKL can read all 6 uops in a cycle from a DSB line.) – Peter Cordes Aug 04 '23 at 20:03
On Intel Skylake-derived CPUs, it's a well-known fact that tiny loops split across 2 uop-cache lines (because they span a 32B boundary) are limited to 2c / iter for that reason. The hardware mechanism that was supposed to handle that case (the LSD) was disabled by microcode because of an erratum. So it's not surprising that leaves performance potholes. It's a bit surprising for Zen 2, since you'd guess the architects might have included something to avoid being slow on such loops, perhaps a loop buffer. – Peter Cordes Aug 04 '23 at 20:08
@PeterCordes `idq.all_dsb_cycles_4_uops` and `idq.all_dsb_cycles_any_uops` are globally equal (~2e9) and the same for the two versions. `idq.mite_uops` is very small compared to `cpu-cycles` also in the two versions (1000x smaller: 1e6-2e6 -- though the slow version has ~30% higher values). `cpu-cycles` is 3.2e9 for the fast version and 4.0e9 for the slow version. What we can conclude is that there are cycles where the IDQ does not send uops and cycles where everything seems fine. I think we can also say that this is not a JCC problem based on the `idq.mite_uops` value. – Jérôme Richard Aug 05 '23 at 12:46
@PeterCordes https://pastebin.com/XR9WZ0AZ shows the result for the `idq_uops_not_delivered.xxx` counters interesting to see why no uop are sent from the IDQ. That being said, I find these counters far from being simple to understand/interpret... They give me headaches. Apparently, the RAT (and the backend) is stalling on the front-end for 1.2e9 cycles in the fast version. The slow version gives very different values. Note that in both versions the DSB is active during the same number of cycles (~2e9) and there is 5 uops/cycle in average to execute during the reported active cycles of the DSB. – Jérôme Richard Aug 05 '23 at 14:12
1

@PeterCordes I can confirm at least that the LSD is disabled (both versions). This is sad since it looks like the problem seems to [only appears with hyper-threading](http://gallium.inria.fr/blog/intel-skylake-bug/) and my processor does not even support it so the LSD should work fine I guess... Regarding Zen 2, I didn't found anything related to the LSD and the quoted sentences seems to show that there is no such a unit on Zen 2. Not to mention, the observed behavior match pretty well with this hypothesis. – Jérôme Richard Aug 05 '23 at 14:48

Why is my loop much faster when it is contained in one cache line?

1 Answers1

Linked