I'm trying to understand why some simple loops run at the speeds they do
First case:
L1:
add rax, rcx # (1)
add rcx, 1 # (2)
cmp rcx, 4096 # (3)
jl L1
And according to IACA, throughput is 1 cycle and bottleneck are ports 1,0,5. I don't understand why it is 1 cylce. After all we have a two loop-carried dependencies:
(1) -> (1) ( Latancy is 1)
(2) -> (2), (2) -> (1), (2) -> (3) (Latency is 1 + 1 + 1).
And this latancy is loop-carried so it should make slower our iteration.
Throughput Analysis Report
--------------------------
Block Throughput: 1.00 Cycles Throughput Bottleneck: Port0, Port1, Port5
Port Binding In Cycles Per Iteration:
-------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 |
-------------------------------------------------------------------------
| Cycles | 1.0 0.0 | 1.0 | 0.0 0.0 | 0.0 0.0 | 0.0 | 1.0 |
-------------------------------------------------------------------------
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | |
---------------------------------------------------------------------
| 1 | 1.0 | | | | | | CP | add rax, rcx
| 1 | | 1.0 | | | | | CP | add rcx, 0x1
| 1 | | | | | | 1.0 | CP | cmp rcx, 0x1000
| 0F | | | | | | | | jl 0xfffffffffffffff2
Total Num Of Uops: 3
Second Case:
L1:
add rax, rcx
add rcx, 1
add rbx, rcx
cmp rcx, 4096
jl L1
Block Throughput: 1.65 Cycles Throughput Bottleneck: InterIteration
Port Binding In Cycles Per Iteration:
-------------------------------------------------------------------------
| Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 |
-------------------------------------------------------------------------
| Cycles | 1.4 0.0 | 1.4 | 0.0 0.0 | 0.0 0.0 | 0.0 | 1.3 |
| Num Of | Ports pressure in cycles | |
| Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | |
---------------------------------------------------------------------
| 1 | 0.6 | 0.3 | | | | | | add rax, rcx
| 1 | 0.3 | 0.6 | | | | | CP | add rcx, 0x1
| 1 | 0.3 | 0.3 | | | | 0.3 | CP | add rbx, rcx
| 1 | | | | | | 1.0 | CP | cmp rcx, 0x1000
| 0F | | | | | | | | jl 0xffffffffffffffef
The more I don't understand why throughput is 1.65.