2

I found in online resource that IvyBridge has 3 ALU. So I write a small program to test:

global _start
_start:
    mov rcx,    10000000
.for_loop:              ; do {
    inc rax
    inc rbx
    dec rcx
    jnz .for_loop       ; } while (--rcx)

    xor rdi,    rdi
    mov rax,    60      ; _exit(0)
    syscall

I compile and run it with perf:

$ nasm -felf64 cycle.asm && ld cycle.o && sudo perf stat ./a.out

The output shows:

10,491,664      cycles

which seems to make sense at the first glance, because there are 3 independent instructions (2 inc and 1 dec) that uses ALU in the loop, so they count 1 cycle together.

But what I don't understand is why the whole loop only has 1 cycle? jnz depends on the result of dec rcx, it should counts 1 cycle, so that the whole loop is 2 cycle. I would expect the output to be close to 20,000,000 cycles.

I also tried to change the second inc from inc rbx to inc rax, which makes it dependent on the first inc. The result does becomes close to 20,000,000 cycles, which shows that dependency will delay an instruction so that they can't run at the same time. So why jnz is special?

What I'm missing here?

user10865622
  • 455
  • 3
  • 11
  • 1
    Probably something related to [instruction fusing](https://en.wikichip.org/wiki/macro-operation_fusion). – David Wohlferd Jan 04 '19 at 05:31
  • 1
    [CMP/TEST/ADD/SUB/INC/DEC/AND can fuse with Jcc](https://en.wikichip.org/wiki/macro-operation_fusion) into a single macro operation – phuclv Jan 04 '19 at 06:22
  • You might want to play with iaca (discussed [here](https://stackoverflow.com/a/26021338/2189500)). It lets you visualize some of this stuff. – David Wohlferd Jan 04 '19 at 06:40
  • note that [`xor edi, edi` would be better than `xor rdi, rdi`](https://stackoverflow.com/q/33666617/995714) since it's shorter – phuclv Jan 07 '19 at 01:55

1 Answers1

3

First of all, dec/jnz will macro-fuse into a single uop on Intel Sandybridge-family. You could defeat that by putting a non-flag-setting instruction between the dec and jnz.

.for_loop:              ; do {
    inc rax
    dec rcx
    lea rbx, [rbx+1]    ; doesn't touch flags, defeats macro-fusion
    jnz .for_loop       ; } while (--rcx)

This will still run at 1 iter per cycle on Haswell and later and Ryzen because they have 4 integer execution ports to keep up with 4 uops per iteration. (Your loop with macro-fusion is only 3 fused-domain uops on Intel CPUs, so SnB/IvB can run it at 1 per clock, too.)

See Agner Fog's optimization guide and especially his microarch guide. Also other links in https://stackoverflow.com/tags/x86/info.


Control dependencies are hidden by branch prediction + speculative execution, unlike data dependencies.

Out-of-order execution and branch prediction + speculative execution hide the "latency" of the control dependency. i.e. the next iteration can start running before the CPU verifies that jnz should really be taken.

So each jnz has an input dependency on the previous dec rcx before it can verify the prediction, but later instructions don't have to wait for it to be checked before they can execute. In-order retirement makes sure that mis-speculation is caught before anything can "see" it happen (except for microarchitectural effects leading to the Spectre attack...)


10M iterations is not a lot. I'd normally use at least 100M for something that runs at only 1c per iter. Having a simple microbenchmark run for 0.1 to 1 second is normally good to get very high precision and hide startup overhead.

And BTW, you don't need sudo perf if you set kernel.perf_event_paranoid = 0 with sysctl. It's almost certainly better to do that than to use sudo all the time.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • I've already convinced by the commenters that the cause is the macro-fusion. But your answer confuse me a lot, since you seem to say the `jnz` will be executed in the same cycle even without macro-fusion. I've tested by inserting `mov rsi, rdi` before `jnz` to prevent fusion, the result becomes 2cycle per iter. But if I insert `mov rsi, rdi` before `dec rcx`, the result is 1cycle per iter. So it seems the macro-fusion is the cause. – user10865622 Jan 04 '19 at 09:31
  • If branch prediction will execute `jnz` regardless of fusion, then how to explain this result? – user10865622 Jan 04 '19 at 09:31
  • I think I understand after reading some sections in Anger's book. When I insert `mov rsi, rdi` before `jnz`, the macro-fusion is prevented. So I end up with 3 uops for ALU, but `jnz` is also a uop for branch, which is at port 5 in IvyBridge and port 5 is an ALU as well. So these 4 uops can't be executed at the same time. Do I understand correctly? (though I still don't know why it becomes 2c/iter, perhaps I should ask a new question for that) – user10865622 Jan 04 '19 at 12:53
  • 1
    @user10865622 `jnz` will be executed in the same cycle even without macro-fusion *on Haswell and later and Ryzen* because there are more ports on these microarchitectures. The first three instructions together with `jnz` of the *previous* iteration can all be dispatched in the same cycle and complete execution in one cycle. – Hadi Brais Jan 04 '19 at 15:07
  • 1
    @user10865622: Adding another uop *and* preventing micro-fusion gives you 5 total fused-domain uops. If you're on IvB, then you have a front-end bottleneck and it's a duplicate of [Is performance reduced when executing loops whose uop count is not a multiple of processor width?](https://stackoverflow.com/a/53148682). That means your loop can only issue at one iteration per 2 clocks, because SnB/IvB don't "unroll" loops inside the loop buffer: the taken loop branch has to be the last uop of an issue group (of up-to-4 uops). – Peter Cordes Jan 04 '19 at 17:44
  • @PeterCordes I haven't thought about front-end being bottleneck, now I can explain the cycle count both for my test and the `lea` example in your answer. Thanks a lot! A quick question: does the "taken branch" mean the branch instruction like `jnz`? I've encountered it many times, though I couldn't find the definition of "taken branch" from Google. – user10865622 Jan 05 '19 at 00:28
  • @user10865622: it means a branch that does jump, rather than falling through. e.g. `jnz` if it runs when ZF is cleared. Unconditional branches like `jmp` and `call` are always taken, but conditional branches (`jcc`) can fall through. Fun fact: haswell and later have a branch execution unit on port 0 that can only handle predicted-not-taken branch and macro-fused-branch uops, as well as the regular branch execution unit on port 6 that can handle any branch. – Peter Cordes Jan 05 '19 at 00:31