I'm assuming there's some actual code after the jz
before that ret
; in your example the fall-through and taken paths both lead to ret
. (Which might as well be the same ret
, no need to replicate.)
You're spending a significant number of extra instructions to branchlessly evaluate a single boolean. Make sure you benchmark it against a simple version that uses 2 branches, on realistic patterns of data as part of your full program.
Modern TAGE branch predictors use previous branch history (along the path of execution) to index a prediction for the current branch. You may find the 2-branch way still predicts well, and doesn't unduly hurt the prediction rate for other branches by polluting more entries.
Microbenchmarking branchy vs. branchless is hard because modern predictors are so sophisticated that it can make a big difference what code feeds it. Isolating code in a repeat loop that runs only that can have a huge effect on branch prediction success.
But yes, your idea is worth considering.
You probably don't want to write CH. That will stall the front-end for a cycle to issue a merging uop in a cycle by itself when reading RCX on Haswell/Skylake CPUs. (How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent)
Instead consider setting CL and DL and using or cl, dl
/ jz
to jump if they're both zero. Also you might want to xor-zero them to avoid a false dependency. or
/jz
can't macro-fuse into a single test-and-branch uop like and
or test
can, but it's still better (on Intel CPUs) than a CH merge. Your way might be better on Ryzen where setnz cl
will just merge into the existing RCX value.
Partial-flag merging is usually more efficient than partial-reg merging on modern Intel CPUs, so maybe shrx
/test
to set ZF, then use bt ecx, 0
to put a setcc result back into CF without disturbing ZF. (It seems to happen without even a flag-merging uop: What is a Partial Flag Stall? - BeeOnRope reports no evidence of flag-merging uops on Skylake.)
If that lets you check both conditions with one branch like ja
or jbe
that depends on both CF and ZF, it might be more efficient to avoid materializing one of the booleans in an integer register.
If you need to invert one or both of the booleans to get it to work:
- you can use
setb
instead of setnb
.
- you can maybe use
andn
instead of test
to invert RAX when testing against the same shifted mask. (Err, I think that only works if you had a single-bit mask.)
To avoid partial-register / false dependency shenanigans you might consider using cmovcc
instead of setcc
; it's single-uop on Intel Broadwell and later, and on AMD. The only mainstream CPU with BMI2 but 2-uop CMOV is Haswell, and that's not a disaster.
IDK if that helps any; you probably still need to zero two registers so you might as well do that for destinations for setcc to avoid false deps.
I think this does help some: we get to use test
instead of or
so it can macro-fuse into a single uop with jnz
.
xor edx, edx ; can hoist this, or use any other register that's known zero in the low 8.
xor ecx, ecx ; just to avoid false deps. Optional if RCX is cold or part of the input dep chain leading to setnb, on Haswell and later or on AMD.
...
setb cl ; instead of setnb
...
and r11, rax
cmovz ecx, edx ; if ZF is set, make the branch is not taken.
test cl, cl
jz below_and_zero_R11
(I probably have one of the conditions flipped, but you can invert the conditions on the setcc, cmovcc, and jcc without affecting performance to get the logic you actually need).
Possibly this can do even better and cmp
/cmov
a non-zero value over r11d
itself, avoiding setcc
. (Defer the cmp
until after producing r11
)
After shr reg, 53
, the upper 32 bits are guaranteed to be zero. You can save code size (REX prefixes) by using 32-bit operand-size. Or you could if you were using one of the low 8 registers, not r8..r15. e.g. shr rdi, 53
/ sub edi, 1023
. Using r8d
won't save code-size because it still needs a REX prefix because of r8.
Defer the cmp
until last so you can use adc
instead of setcc
to read CF.
setnb
tests that CF=0. We can instead use adc
or sbb
to modify a setz
or setnz
result. adc reg,0
is a single-uop instruction on every CPU that supports BMI2 (as long as you avoid the adc al, imm8
special case encoding). Which Intel microarchitecture introduced the ADC reg,0 single-uop special case?
(Update: apparently adc cl,0
is still 2 uops on Haswell. So use adc ecx,0
instead. With xor-zeroing of ECX ahead of this, it's still safe for P6-family, not causing a partial-register stall. You need the whole ECX to be zeroed ahead of setcc
if you depend on the upper bits being zero.)
mov r10, 0x1ffffffffffffe ; mask
mov r8, rax
shr r8, 53
sub r8d, 1023
shrx r11, r10, r8
xor ecx, ecx ; avoid false dep
and r11, rax
setnz cl ; r11 == 0
cmp r8, 52 ; r8 < 52 (unsigned)
adc ecx, 0 ; cl = ZF (from r11) + CF (from cmp).
; cl = (r11!=0) + (r8<52)
; test cl, cl ; ADC sets flags
jz @C_2 ; or JNZ, I didn't check the logic
...
@C_2: ; integer
ret
adc ecx,0
can only make ECX non-zero. You can't have CF=1 result in cl=0 without a dependency on the old cl
.
But another option to combine conditions is sbb ecx, 0
and then check CF: CF will only be set if ECX was zero and became -1
. i.e. old_ecx = 0 and input_CF = 1.
Maybe just use the FPU:
If you have BMI2, you almost certainly have SSE4.1. (And probably AVX).
If throughput is more important than latency, consider using roundsd
(or roundpd
to check 2 at once):
roundpd xmm1, xmm0, something ; TODO: look up what immediate you want for round-to-nearest
pcmpeqq xmm1, xmm0 ; compare the FP bit patterns
movmskpd ecx, xmm1 ; extract the sign bits
; ecx=0b11 if rounding to integer didn't change the bit-pattern
roundpd
/ roundsd
is 2 uops. (https://agner.org/optimize).
Also, if you have a lot to check in a row without any other FP ops, then maybe consider just looking at MXCSR to see if a conversion set the "inexact" flag. That involves storing MXCSR to memory with stmxcsr m32
and reloading, but store-forwarding makes that efficient. e.g. do a group of 8 and then check that sticky MXCSR flag to see if any of them were non-integer, then go back and see which one of the group it was.
(If you actually want the conversion result then you could be using cvtsd2si rax, xmm0
instead of roundsd
)
Clearing the Inexact flag before an operation would certainly add to the cost, though. But ldmxcsr
is not too expensive. IIRC, modern CPUs rename MXCSR so it doesn't serialize FP operations.