3

I have the next ASM code:

        mov                      r10  , 9007199254740990        ; mask
        mov                      r8   , rax
        shr                      r8   , 53
        sub                      r8   , 1023
        cmp                      r8   , 52                      ; r8 - 52 < 0
        setnb                    ch
        shrx                     r11  , r10  , r8
        and                      r11  , rax
        setne                    cl                             ; r11 == 0

        test                     rcx  , rcx
        jz      @C_2

        ret
@C_2:   ; integer
        ret

Well, here we have only one branch instruction. And we can rewrite this code by replacing SETcc instructionos on corresponding Jump instructions, and thus we'll get two branch instructions in the code above. My question is, which code will run faster in common(random data) case and why?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Ivan Kamynin
  • 51
  • 1
  • 6
  • 1
    Both will be similarly slow on random data, the SETcc is probably slightly better, but you are killing the performance by other mistakes, so this code generally is far from optimal. Maybe rewrite it into C++ and check output of modern compiler with optimizations enabled, should produce better code than this. (I mean after you would remove the other performance penalties, maybe the SETcc variant would start to show slight edge over Jcc variants, but with current way the difference will be highly likely hidden in the other penalties induced by this code) – Ped7g Sep 17 '19 at 13:47
  • Why thats code not optimal by facts? I check compilers output, and that's why i'm asking about optimallity... – Ivan Kamynin Sep 17 '19 at 14:27
  • A C or C++ compiler will always favor SETcc over branches, when it can. https://stackoverflow.com/a/11227902/17034 – Hans Passant Sep 17 '19 at 17:12
  • @IvanKamynin it's using partial-register writes and then reads whole `rcx`, and actually I'm not even sure why already at the place of first `setnb` you are not short-circuiting the test and jumping to `@C_2` instantly (maybe you need the remaining part of calculation?), the `shr r8, 53` may be eventually avoidable in certain cases, although I may have misunderstood what this code is trying to calculate (I kinda didn't truly bother after watching the `ch` + `cl` + `rcx` part without even `xor ecx,ecx` ahead, this alone should induce quite some penalty. – Ped7g Sep 17 '19 at 17:33
  • but as always with performance, these theoretical chats are nice, but without profiling data it's just chit-chat. And profiling this kind of micro optimizations is very tricky, I currently don't have on my local box good test harness to just plug this into, and toy around with it :/ (plus before trying to write better variant, I would definitely would want first the math-specs, what it is actually doing, maybe there is simpler calculation to get the desired result) – Ped7g Sep 17 '19 at 17:35
  • Well, this code checks that double is simple integer in [0; 2^53). And in my code above this true only if the whole RCX register is zero, after two comparisions. And after that's knowledge your mention about bottlnecks is still same? – Ivan Kamynin Sep 17 '19 at 17:58

1 Answers1

3

I'm assuming there's some actual code after the jz before that ret; in your example the fall-through and taken paths both lead to ret. (Which might as well be the same ret, no need to replicate.)


You're spending a significant number of extra instructions to branchlessly evaluate a single boolean. Make sure you benchmark it against a simple version that uses 2 branches, on realistic patterns of data as part of your full program.

Modern TAGE branch predictors use previous branch history (along the path of execution) to index a prediction for the current branch. You may find the 2-branch way still predicts well, and doesn't unduly hurt the prediction rate for other branches by polluting more entries.

Microbenchmarking branchy vs. branchless is hard because modern predictors are so sophisticated that it can make a big difference what code feeds it. Isolating code in a repeat loop that runs only that can have a huge effect on branch prediction success.

But yes, your idea is worth considering.


You probably don't want to write CH. That will stall the front-end for a cycle to issue a merging uop in a cycle by itself when reading RCX on Haswell/Skylake CPUs. (How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent)

Instead consider setting CL and DL and using or cl, dl / jz to jump if they're both zero. Also you might want to xor-zero them to avoid a false dependency. or/jz can't macro-fuse into a single test-and-branch uop like and or test can, but it's still better (on Intel CPUs) than a CH merge. Your way might be better on Ryzen where setnz cl will just merge into the existing RCX value.


Partial-flag merging is usually more efficient than partial-reg merging on modern Intel CPUs, so maybe shrx/test to set ZF, then use bt ecx, 0 to put a setcc result back into CF without disturbing ZF. (It seems to happen without even a flag-merging uop: What is a Partial Flag Stall? - BeeOnRope reports no evidence of flag-merging uops on Skylake.)

If that lets you check both conditions with one branch like ja or jbe that depends on both CF and ZF, it might be more efficient to avoid materializing one of the booleans in an integer register.

If you need to invert one or both of the booleans to get it to work:

  • you can use setb instead of setnb.
  • you can maybe use andn instead of test to invert RAX when testing against the same shifted mask. (Err, I think that only works if you had a single-bit mask.)

To avoid partial-register / false dependency shenanigans you might consider using cmovcc instead of setcc; it's single-uop on Intel Broadwell and later, and on AMD. The only mainstream CPU with BMI2 but 2-uop CMOV is Haswell, and that's not a disaster.

IDK if that helps any; you probably still need to zero two registers so you might as well do that for destinations for setcc to avoid false deps.

I think this does help some: we get to use test instead of or so it can macro-fuse into a single uop with jnz.

    xor  edx, edx   ; can hoist this, or use any other register that's known zero in the low 8.

    xor    ecx, ecx        ; just to avoid false deps.  Optional if RCX is cold or part of the input dep chain leading to setnb, on Haswell and later or on AMD.
    ...
    setb   cl              ; instead of setnb
    ...

    and    r11, rax
    cmovz  ecx, edx        ; if ZF is set, make the branch is not taken.

    test   cl, cl
    jz     below_and_zero_R11

(I probably have one of the conditions flipped, but you can invert the conditions on the setcc, cmovcc, and jcc without affecting performance to get the logic you actually need).

Possibly this can do even better and cmp/cmov a non-zero value over r11d itself, avoiding setcc. (Defer the cmp until after producing r11)


After shr reg, 53, the upper 32 bits are guaranteed to be zero. You can save code size (REX prefixes) by using 32-bit operand-size. Or you could if you were using one of the low 8 registers, not r8..r15. e.g. shr rdi, 53 / sub edi, 1023. Using r8d won't save code-size because it still needs a REX prefix because of r8.


Defer the cmp until last so you can use adc instead of setcc to read CF.

setnb tests that CF=0. We can instead use adc or sbb to modify a setz or setnz result. adc reg,0 is a single-uop instruction on every CPU that supports BMI2 (as long as you avoid the adc al, imm8 special case encoding). Which Intel microarchitecture introduced the ADC reg,0 single-uop special case?

(Update: apparently adc cl,0 is still 2 uops on Haswell. So use adc ecx,0 instead. With xor-zeroing of ECX ahead of this, it's still safe for P6-family, not causing a partial-register stall. You need the whole ECX to be zeroed ahead of setcc if you depend on the upper bits being zero.)

        mov                      r10, 0x1ffffffffffffe        ; mask

        mov                      r8, rax
        shr                      r8, 53
        sub                      r8d, 1023

        shrx                     r11, r10, r8
        xor                      ecx, ecx                      ; avoid false dep
        and                      r11, rax
        setnz                    cl                            ; r11 == 0

        cmp                      r8, 52                        ; r8 < 52 (unsigned)
        adc                      ecx, 0              ; cl = ZF (from r11) + CF (from cmp).
        ; cl = (r11!=0) + (r8<52)

        ; test                     cl, cl           ; ADC sets flags
        jz      @C_2                             ; or JNZ, I didn't check the logic

        ...

@C_2:   ; integer
        ret

adc ecx,0 can only make ECX non-zero. You can't have CF=1 result in cl=0 without a dependency on the old cl.

But another option to combine conditions is sbb ecx, 0 and then check CF: CF will only be set if ECX was zero and became -1. i.e. old_ecx = 0 and input_CF = 1.


Maybe just use the FPU:

If you have BMI2, you almost certainly have SSE4.1. (And probably AVX).

If throughput is more important than latency, consider using roundsd (or roundpd to check 2 at once):

    roundpd   xmm1, xmm0,  something       ; TODO: look up what immediate you want for round-to-nearest
    pcmpeqq   xmm1, xmm0                   ; compare the FP bit patterns
    movmskpd  ecx, xmm1                    ; extract the sign bits
    ; ecx=0b11  if rounding to integer didn't change the bit-pattern

roundpd / roundsd is 2 uops. (https://agner.org/optimize).

Also, if you have a lot to check in a row without any other FP ops, then maybe consider just looking at MXCSR to see if a conversion set the "inexact" flag. That involves storing MXCSR to memory with stmxcsr m32 and reloading, but store-forwarding makes that efficient. e.g. do a group of 8 and then check that sticky MXCSR flag to see if any of them were non-integer, then go back and see which one of the group it was.

(If you actually want the conversion result then you could be using cvtsd2si rax, xmm0 instead of roundsd)

Clearing the Inexact flag before an operation would certainly add to the cost, though. But ldmxcsr is not too expensive. IIRC, modern CPUs rename MXCSR so it doesn't serialize FP operations.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • w0w m3g4 4n5w3r! // ECX in my case always zero, because is setted earlier – Ivan Kamynin Sep 17 '19 at 18:22
  • As i can see on uops.info, we should use adc R64/32 0x0, for get 1 latency. – Ivan Kamynin Sep 19 '19 at 11:46
  • @IvanKamynin: hmm, it seems you're right; according to uops.info `adc bl, 0` isn't special-cased the way `adc ebx, 0` is. https://www.uops.info/html-tp/HSW/ADC_NOREX_R8_I8-Measurements.html. I was misremembering that. On Haswell and later writing a low-8 reg already does an RMW of the whole reg, so it doesn't cost an extra uop to read ECX after writing CL. But you do have to make sure that ECX is zeroed so ZF is set correctly. – Peter Cordes Sep 19 '19 at 17:20