I found in online resource that IvyBridge has 3 ALU. So I write a small program to test:
global _start
_start:
mov rcx, 10000000
.for_loop: ; do {
inc rax
inc rbx
dec rcx
jnz .for_loop ; } while (--rcx)
xor rdi, rdi
mov rax, 60 ; _exit(0)
syscall
I compile and run it with perf
:
$ nasm -felf64 cycle.asm && ld cycle.o && sudo perf stat ./a.out
The output shows:
10,491,664 cycles
which seems to make sense at the first glance, because there are 3 independent instructions (2 inc
and 1 dec
) that uses ALU in the loop, so they count 1 cycle together.
But what I don't understand is why the whole loop only has 1 cycle? jnz
depends on the result of dec rcx
, it should counts 1 cycle, so that the whole loop is 2 cycle. I would expect the output to be close to 20,000,000 cycles
.
I also tried to change the second inc
from inc rbx
to inc rax
, which makes it dependent on the first inc
. The result does becomes close to 20,000,000 cycles
, which shows that dependency will delay an instruction so that they can't run at the same time. So why jnz
is special?
What I'm missing here?