2

I'm trying to understand why some simple loops run at the speeds they do

First case:

L1:
    add rax, rcx  # (1)
    add rcx, 1    # (2)
    cmp rcx, 4096 # (3)
    jl L1

And according to IACA, throughput is 1 cycle and bottleneck are ports 1,0,5. I don't understand why it is 1 cylce. After all we have a two loop-carried dependencies:

(1) -> (1) ( Latancy is 1) 
(2) -> (2), (2) -> (1), (2) -> (3) (Latency is 1 + 1 + 1).

And this latancy is loop-carried so it should make slower our iteration.

Throughput Analysis Report
--------------------------
Block Throughput: 1.00 Cycles       Throughput Bottleneck: Port0, Port1, Port5

Port Binding In Cycles Per Iteration:
-------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |
-------------------------------------------------------------------------
| Cycles | 1.0    0.0  | 1.0  | 0.0    0.0  | 0.0    0.0  | 0.0  | 1.0  |
-------------------------------------------------------------------------


| Num Of |              Ports pressure in cycles               |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |    |
---------------------------------------------------------------------
|   1    | 1.0       |     |           |           |     |     | CP | add rax, rcx
|   1    |           | 1.0 |           |           |     |     | CP | add rcx, 0x1
|   1    |           |     |           |           |     | 1.0 | CP | cmp rcx, 0x1000
|   0F   |           |     |           |           |     |     |    | jl 0xfffffffffffffff2
Total Num Of Uops: 3

Second Case:

L1:    
    add rax, rcx
    add rcx, 1
    add rbx, rcx
    cmp rcx, 4096
    jl L1

Block Throughput: 1.65 Cycles       Throughput Bottleneck: InterIteration

Port Binding In Cycles Per Iteration:
-------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |
-------------------------------------------------------------------------
| Cycles | 1.4    0.0  | 1.4  | 0.0    0.0  | 0.0    0.0  | 0.0  | 1.3  |


| Num Of |              Ports pressure in cycles               |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |    |
---------------------------------------------------------------------
|   1    | 0.6       | 0.3 |           |           |     |     |    | add rax, rcx
|   1    | 0.3       | 0.6 |           |           |     |     | CP | add rcx, 0x1
|   1    | 0.3       | 0.3 |           |           |     | 0.3 | CP | add rbx, rcx
|   1    |           |     |           |           |     | 1.0 | CP | cmp rcx, 0x1000
|   0F   |           |     |           |           |     |     |    | jl 0xffffffffffffffef

The more I don't understand why throughput is 1.65.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Gilgamesz
  • 4,727
  • 3
  • 28
  • 63
  • did you try running it, and measuring the instructions-per-cycle? (Change 4096 to something gigantic). I haven't finished analysing it yet, but your latency of 1+1+1 is obviously wrong: `cmp` is only part of a dependency chain when a later insn reads the flags. – Peter Cordes Apr 25 '16 at 12:51
  • IACA is *not* cycle-accurate. It's always an approximation, except in easy cases where the bottleneck is simple. Its 1.65 is pretty close to the observed number. Oh I just noticed that 1.65 is from a different loop with an extra `add`. – Peter Cordes Apr 25 '16 at 13:02
  • @PeterCordes I edited. – Gilgamesz Apr 25 '16 at 13:11

1 Answers1

2

In the first loop, there are two dep chains, one for rax and one for rcx.

add rax, rcx  # depends on rax and rcx from the previous iteration, produces rax for the next iteration

add rcx, 1    # latency = 1

The 2-cycle latency dep chain of add rcx,1 -> add rax, rcx spans 2 iterations (so it already has time to happen) and it's not even loop-carried anyway (because rax doesn't feed back into the add rcx,1).

In any given iteration, only the previous iteration's results are needed to produce this iteration's results. There are no loop-carried dependencies within an iteration, only between iterations.

Like I explained in answer to your question a couple days ago, the cmp/jcc is not part of the loop-carried dep chain.

cmp is only part of a dep chain if a cmov or setcc reads the flag output it generates. Control dependencies are predicted, not waited for like data dependencies.

In practice, on my E6600 (first-gen Core2, I don't have a SnB available at the moment):

; Linux initializes most registers to zero on process startup, and I'm lazy so I depended on this for this one-off test.  In real code, I'd xor-zero ecx
    global _start
_start:
L1:
    add eax, ecx        ; (1)
    add ecx, 1          ; (2)
    cmp ecx, 0x80000000 ; (3)
    jb L1            ; can fuse with cmp on Core2 (in 32bit mode)

    mov eax, 1
    int 0x80

I ported it to 32bit since Core2 can only macro-fuse in 32bit mode, and used a jb since Core2 can only macro-fuse unsigned branch conditions. I used a large loop counter so I didn't need another loop outside this. (IDK why you picked a tiny loop count like 4096. Are you sure you weren't measuring extra overhead from something else outside your short loop?)

$ yasm -Worphan-labels -gdwarf2 -felf tinyloop.asm && ld -m elf_i386 -o tinyloop tinyloop.o
$ perf stat -e task-clock,cycles,instructions,branches ./tinyloop

Performance counter stats for './tinyloop':

    897.994122      task-clock (msec)         #    0.993 CPUs utilized          
 2,152,571,449      cycles                    #    2.397 GHz                    
 8,591,925,034      instructions              #    3.99  insns per cycle        
 2,147,844,593      branches                  # 2391.825 M/sec                  

   0.904020721 seconds time elapsed

So it runs at 3.99 insns per cycle, which means ~one iteration per cycle.

I'd be surprised if your Ivybridge runs that exact code only about half as fast. Update: per discussion in chat, yes, it appears IVB really does only get 2.14 IPC. (one iteration per 1.87c). Changing the add rax, rcx to add rax, rbx or something to remove the dependency on the loop counter from the previous iteration brings the throughput up to 3.8 IPC (one iteration per 1.05c). I don't understand why this happens.

With a similar loop that doesn't depend on macro-fusion, (add / inc ecx / jnz) I also get one iteration per 1c. (2.99 insns per cycle).

However, having a 4th insn in the loop that also reads ecx makes it massively slower. Core2 can issue 4 uops per clock, even though (like SnB/IvB) it only has three ALU ports. (A lot of code include memory uops, so this does make sense.)

add eax, ecx       ; changing this to add eax,ebx  helps when there are 4 non-fusing insns in the loop
; add edx, ecx     ; slows us down to 1.34 IPC, or one iter per 3c
; add edx, ebx     ; only slows us to 2.28 IPC, or one iter per 1.75c
                   ; with neither:    3    IPC, or one iter per 1c
inc ecx
jnz L1             # loops 2^32 times, doesn't macro-fuse on Core2

I expected to still run at 3 IPC, or one iter per 4/3 = 1.333c. However, pre-SnB CPUs have many more bottlenecks, like ROB-read and register-read bottlenecks. SnB's switch to a physical register file eliminated those slowdowns.


In your 2nd loop, IDK why it doesn't run at one iteration per 1.333c. The insn updating rbx can't run until after the other instructions from that iteration, but that's what out-of-order execution is for. Are you sure it's as slow as one iteration per 1.85 cycles? You measured with perf for a high enough count to get meaningful data? (rdtsc cycle counting isn't accurate unless you disable turbo and frequency scaling, but perf counters still count actual core cycles).

I wouldn't expect it to be much different from

L1:    
    add rax, rcx
    add rbx, rcx      # before/after inc rcx shouldn't matter because of out-of-order execution
    add rcx, 1
    cmp rcx, 4096
    jl L1
Community
  • 1
  • 1
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Thanks :), In comment to your answer here: http://stackoverflow.com/questions/36739118/dependency-chain-analysis/36758940#36758940 you said that there is loop-carried dependency ( I mean that : "add is the loop-carried dependency chain," ) and now you said there there is no. – Gilgamesz Apr 25 '16 at 15:02
  • 1
    @Gilgamesz: My wording was confusing, sorry. I meant that `cmp`'s dependency on `add` exists, but isn't part of a loop-carried dependency chain. There are of course loop-carried dependency chains, but the dependencies within each iteration aren't part of them. – Peter Cordes Apr 25 '16 at 15:34
  • For my first loop I got 1.9 cylce per iteration ( perf and Agner Fog tool) but I cannot understand why 1.9. We have two cycle latency loop-carried `add rax, rcx # depends on rax and rcx from the previous iteration, produces rax for the next iteration` – Gilgamesz Apr 25 '16 at 16:03
  • 1
    Are you sure your numbers are cycles per iteration, not **instructions per cycle**? `perf` can't directly measure cycles per iter. But anyway, *no*, there is not a 2-cycle latency dependency chain between one iteration and the next. The 2-cycle latency dep chain of `inc rcx` -> `add rax, rcx` is across 2 iterations (so it already has time to happen) and it's not loop-carried (because `rax` doesn't feed back into the `inc rcx`). – Peter Cordes Apr 25 '16 at 16:28
  • 1
    I got from `perf` : `192380778 cycles # 3,207 GHz ` and I divided it by number of iterations 100000000. Perhaps I misunderstand something but I am making progress thanks to you ;) – Gilgamesz Apr 25 '16 at 16:35
  • @Gilgamesz: Yup, you calculated correctly, but it's probably best to look at insns per cycle from `perf` and then divide the number of insns in the loop by that. If there's any sampling noise, it will might be similar for cycles and insns. I added some test results for Core2. – Peter Cordes Apr 25 '16 at 17:12
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/110179/discussion-between-gilgamesz-and-peter-cordes). – Gilgamesz Apr 25 '16 at 17:46