- I cannot understand why the first code has ~1 cycle per iteration and second has 2 cycle per iteration. I measured with Agner's tool and perf. According to IACA it should take 1 cycle, from my theoretical computations too.
This takes 1 cycle per iteration.
; array is array defined in section data
%define n 1000000
xor rcx, rcx
.begin:
movnti [array], eax
add rcx, 1
cmp rcx, n
jle .begin
And this takes 2 cycles per iteration. but why?
; array is array defined in section data
%define n 1000000
xor rcx, rcx
.begin:
movnti [array], eax
nop
add rcx, 1
cmp rcx, n
jle .begin
This final version takes ~27 cycles per iteration. But why? After all, there is no dependency chain.
.begin:
movnti [array], eax
mov rbx, [array+16]
add rcx, 1
cmp rcx, n
jle .begin
My CPU is IvyBridge.