1

I have a program written in NASM 64 with seven "for" loops and one conditional branch from an if statement. Valgrind shows the following branch miss stats:

==22040== Branches:       23,608,086  (23,606,080 cond +      2,006 ind)
==22040== Mispredicts:     2,307,291  ( 2,306,609 cond +        682 ind)
==22040== Mispred rate:          9.8% (       9.8%     +       34.0%   )

As an experiment, I replaced all but the outermost loop and the branch from the if statement with cmovz instructions to avoid conditional jumps.

For example:

add rcx,1
cmp rcx,[rbp-72]
jl Range_17_Loop

is replaced with:

add rcx,1
cmp rcx,[rbp-72]
cmovz r10,r11
jmp r10

in the following loop:

xor r8,r8

push r10
push r11
mov r10,[jump_17_01] ; move label1 address to r10
mov r11,[jump_17_02] ; move label2 address to r11

Range_17_Loop:

bt rcx,0
jc return_label_1

mov rax,rcx
mul rcx
mov rdx,rax
     
mov [r9+r8],rdx
     
add r8,8
return_label_1:
add rcx,1

cmp rcx,[rbp-72]
cmovz r10,r11 ; conditional branch replaced with cmovz
jmp r10
;jl Range_17_Loop

Range_17_Loop_Exit:

pop r11
pop r10

This works because we can get the address of the two labels in NASM before the outer enclosing loop begins:

mov rax,Range_17_Loop
mov [jump_17_01],rax
mov rax,Range_17_Loop_Exit
mov [jump_17_02],rax

so we can jump to either label with the address in a register.

Because cmov instructions bypass branch prediction (see Why is a conditional move not vulnerable for Branch Prediction Failure?), I thought this would reduce the number of branch mispredicts. But Valgrind shows a much higher rate of branch misprediction after the change to cmovz:

==22180== Branches:       23,608,122  ( 9,846,969 cond + 13,761,153 ind)
==22180== Mispredicts:     6,267,691  (   336,434 cond +  5,931,257 ind)
==22180== Mispred rate:         26.5% (       3.4%     +       43.1%   )

So now we have gone from 9.8% branch mispredicts to 26.5%, and the execution speed is slower by 8% from before.

The number of branches taken does not change much (23,608,086 vs 23,608,122) so I suspect a jmp (unconditional branch) counts as a branch, but why the much higher mispredict rate if cmov bypasses the branch prediction unit?

The entire assembly listing is almost 600 lines long so I didn't post it here due to its length. The block posted above shows the issue, but on request I will post the whole listing.

Thanks.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
RTC222
  • 2,025
  • 1
  • 20
  • 53
  • 4
    CMOV is a branchless select, but the `jmp r10` is *harder* to predict than a normal conditional branch that only has 2 possible destinations (fall through or taken). At least for valgrind's simulator; real CPUs (like Haswell and later with IT-TAGE predictors) may do at best equal to a normal conditional branch. – Peter Cordes Aug 25 '21 at 19:34
  • I suspected that. Cmov is often used to substitute one value for another in a math operation, but using it for branching seems -- from this experience -- not as predictable, like you said. – RTC222 Aug 25 '21 at 19:36
  • 2
    The mispredictions have basically nothing to do with cmov, everything to do with the pattern of indirect branch targets, however they're generated. The only way I could see this making sense is if there was some code-reuse with one loop where you could plug in different targets, although most likely you'd be best with a `call r10` function pointer to a function that returns with a normal `ret` to plug a different block of code into a loop. – Peter Cordes Aug 25 '21 at 19:39
  • 1
    Basically, there's no magic bullet. If this could be a win, CPUs would already be doing it internally in the implementation of `jl` – Peter Cordes Aug 25 '21 at 19:40
  • The call-return is interesting, but what about the overhead of the call? – RTC222 Aug 25 '21 at 19:42
  • 1
    That's the price you pay for being able to plug in a different block into the same loop, instead of making a separate version of the loop for each different thing you might want to inline into it. (Or JITing to do the plugging in on the fly, could be worth it for long-running loops.) call is only 2 uops vs. 1 for `jmp reg`, and `ret` is only 1, not a huge deal. (https://uops.info/ reports suspiciously bad throughput for `ret` on SKL, perhaps testing *just* ret, not call/ret, and maybe in a big sequence underflowing the RAS predictor. – Peter Cordes Aug 25 '21 at 19:50
  • 1
    I suspect call/ret would be cheaper than setting up two jump-target registers, one for the jump out of the loop and one back into it. Obviously this wouldn't be to a function that follows the ABI, so you'd just be using registers like it was part of the same function. If there's only one loop you ever want to plug blocks into, though, I guess you could use `jmp reg` to go to the right block, but have each block use `jmp rel32` to get back; direct jumps are easier for the front-end to handle, taking less time to resteer if they weren't predicted before decode. – Peter Cordes Aug 25 '21 at 19:52
  • It's worth a try, as an experiment, but I wonder how much improvement I would get. I would still need to "call" conditionally which I think would involve prediction. – RTC222 Aug 25 '21 at 19:55
  • I'm not talking about replacing the loop branch like you're doing here, I'm talking about a generic loop that always needs some extra filler. Hmm, I guess having each possible block end with an indirect branch would mean you could chain them together, but only if each block referenced a separate static storage location for its target pointer. Obviously best to just inline, but if that's prohibitive for code-size it's worth considering other options. (e.g. N different loops, each of which can use M different helper blocks, then call-reg/ret is nice). – Peter Cordes Aug 25 '21 at 20:24

0 Answers0