I have a program written in NASM 64 with seven "for" loops and one conditional branch from an if statement. Valgrind shows the following branch miss stats:
==22040== Branches: 23,608,086 (23,606,080 cond + 2,006 ind)
==22040== Mispredicts: 2,307,291 ( 2,306,609 cond + 682 ind)
==22040== Mispred rate: 9.8% ( 9.8% + 34.0% )
As an experiment, I replaced all but the outermost loop and the branch from the if statement with cmovz instructions to avoid conditional jumps.
For example:
add rcx,1
cmp rcx,[rbp-72]
jl Range_17_Loop
is replaced with:
add rcx,1
cmp rcx,[rbp-72]
cmovz r10,r11
jmp r10
in the following loop:
xor r8,r8
push r10
push r11
mov r10,[jump_17_01] ; move label1 address to r10
mov r11,[jump_17_02] ; move label2 address to r11
Range_17_Loop:
bt rcx,0
jc return_label_1
mov rax,rcx
mul rcx
mov rdx,rax
mov [r9+r8],rdx
add r8,8
return_label_1:
add rcx,1
cmp rcx,[rbp-72]
cmovz r10,r11 ; conditional branch replaced with cmovz
jmp r10
;jl Range_17_Loop
Range_17_Loop_Exit:
pop r11
pop r10
This works because we can get the address of the two labels in NASM before the outer enclosing loop begins:
mov rax,Range_17_Loop
mov [jump_17_01],rax
mov rax,Range_17_Loop_Exit
mov [jump_17_02],rax
so we can jump to either label with the address in a register.
Because cmov instructions bypass branch prediction (see Why is a conditional move not vulnerable for Branch Prediction Failure?), I thought this would reduce the number of branch mispredicts. But Valgrind shows a much higher rate of branch misprediction after the change to cmovz:
==22180== Branches: 23,608,122 ( 9,846,969 cond + 13,761,153 ind)
==22180== Mispredicts: 6,267,691 ( 336,434 cond + 5,931,257 ind)
==22180== Mispred rate: 26.5% ( 3.4% + 43.1% )
So now we have gone from 9.8% branch mispredicts to 26.5%, and the execution speed is slower by 8% from before.
The number of branches taken does not change much (23,608,086 vs 23,608,122) so I suspect a jmp (unconditional branch) counts as a branch, but why the much higher mispredict rate if cmov bypasses the branch prediction unit?
The entire assembly listing is almost 600 lines long so I didn't post it here due to its length. The block posted above shows the issue, but on request I will post the whole listing.
Thanks.