2

From Agner Fog's "Optimizing Assembly" guide, Section 12.7: a loop example. One of the paragraphs discussing the example code:

[...] Analysis for Pentium M: ... 13 uops at 3 per clock = one iteration per 4.33c retirement time.

There is a dependency chain in the loop. The latencies are: 2 for memory read, 5 for multiplication, 3 for subtraction, and 3 for memory write, which totals 13 clock cycles. This is three times as much as the retirement time but it is not a loop-carried dependence because the results from each iteration are saved to memory and not reused in the next iteration. The out-of-order execution mechanism and pipelining makes it possible that each calculation can start before the preceding calculation is finished. The only loop-carried dependency chain is add eax,16 which has a latency of only 1.

## Example 12.6b.  DAXPY algorithm, 32-bit mode
[...]   ; not shown: initialize some regs before the loop
L1:
    movapd xmm1, [esi+eax]   ; X[i], X[i+1]
    mulpd  xmm1, xmm2        ; X[i] * DA, X[i+1] * DA
    movapd xmm0, [edi+eax]   ; Y[i], Y[i+1]
    subpd  xmm0, xmm1        ; Y[i]-X[i]*DA, Y[i+1]-X[i+1]*DA
    movapd [edi+eax], xmm0   ; Store result
    add eax, 16              ; Add size of two elements to index
    cmp eax, ecx             ; Compare with n*8
    jl L1                    ; Loop back

I cannot understand why the dependency chain doesn't increase a whole throughput. I know that it is only important to find the worst bottleneck. The worst bottleneck identified before considering dependency chains was fused-domain uop throughput, at 4.33 cycles per iteration. I cannot understand why the dependency chain isn't a bigger bottleneck than that.

  1. I see that the author explains that it is connected with out-of-order execution and pipelining but I cannot see it. I mean, though, only multiplication causes latency 5 cycles so only this value is greater than 4 cycle.

  2. I also cannot understand why the author doesn't care about the dependency here: add eax, 16 -> cmp eax, ecx -> jl L1 After all, addition must be executed before cmp and cmp must be executed before jl.


PS: later paragraphs identify the biggest bottleneck for Pentium M as decode, limiting it to one iteration per 6c, because 128b vector ops decode to two uops each. See Agner Fog's guide for the rest of the analysis, and analysis + tuning for Core2, FMA4 Bulldozer, and Sandybridge.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Gilgamesz
  • 4,727
  • 3
  • 28
  • 63
  • The compare/branch pair would be predicted so it doesn't really count. Apart from that I'm not sure what you're asking – harold Apr 20 '16 at 09:27
  • 1
    Can you please link Agner's document and state what section and example you are referencing? – Z boson Apr 20 '16 at 09:49

1 Answers1

3
  1. the mul isn't part of a loop-carried dependency chain, so there can be mulpd insns from multiple iterations in flight at once. The latency of a single instruction isn't the issue here at all, it's the dependency chain. Each iteration has a separate 13c dependency chain of load, mulpd, subpd, store. Out-of-order execution is what allows uops from multiple iterations to be in flight at once.

  2. The cmp / jl in each iteration depend on the add from that iteration, but the add in the next iteration doesn't depend on the cmp. Speculative execution and branch prediction mean that control dependencies (conditional branches and indirect jumps/calls) are not part of data dependency chains. This is why instructions from one iteration can start running before the jl from the preceding iteration retires.

    By comparison, cmov is a data dependency instead of a control dependency, so branchless loops tend to have loop-carried dependency chains. This tends to be slower than branching if the branch predicts well.

    Each loop iteration has a separate cmp/jl dependency chain, just like the FP dependency chain.


I cannot understand why dependency chain doesn't increase a whole throughput.

I have no idea what this sentence means. I think I was able to figure out all your other mixed up words and phrasing. (e.g. "chain dependency" instead of "dependency chain".) Have a look at my edits to your question; some of them might help your understanding, too.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Thanks :). To first your point: Ok, it is clear that multiple iterations can be in flight at once. But, When It comes to only ONE iteration, particualar iteration. Why latency of mulpd ( 5 cycles) doesn't matter? After all, `subpd xmm0, xmm1` must be followed by `mulpd xmm1, xmm2` ( in one dependency chain for one iteration). Sorry for my English, I know that it can be problematic. 2. Agner Fog says that `add eax, 16` is loop-carried and it costs 1 cycle (latency). – Gilgamesz Apr 22 '16 at 17:50
  • 1
    @Gilgamesz: 2. That's correct. `add` -> `add` is the loop-carried dependency chain, not `add -> cmp -> jl -> add`. – Peter Cordes Apr 23 '16 at 00:48
  • 1
    re: first point: Can you be more specific about why you think it *does* matter? We're calculating throughput, *not* the latency of a single iteration. As long as the out-of-order insn scheduler and ReOrder Buffer are big enough to expose the parallelism between iterations, latency of the dep chain within an iteration is irrelevant. (A really long dep chain would require a big scheduler and ROB). Latency of any specific instructions in that dep chain is even less relevant. – Peter Cordes Apr 23 '16 at 00:49
  • Ok, @Peter Cordes it makes a sense. Thank you very much. :) – Gilgamesz Apr 23 '16 at 09:21