6

For example, Let's say a variable x, x could be anything include 0.
Then we got code like:

if(x==0) {
    y = 1;
}
else {
    y = x;
}

Could I do this without producing branches in C/C++?

I'm trying to optimize a piece of code. I want to remove branches as much as possible. There are similar judgments, so I want to convert them into statements without branches to make the code as efficient as possible.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
JinZhu
  • 63
  • 3
  • This kind of optimisation is usually attempted in often executed code parts, e.g. within nested loops. You might achieve enough, if you do the checking as far outside of nesting as possible. I.e. work on the `x` long before you get to the shown code. – Yunnosch Oct 27 '22 at 05:52
  • Otherwise please consider rephrasing and elaborating your problem, considerung whether this might be a https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem Otherwise the answer is probably basically "Not.". – Yunnosch Oct 27 '22 at 05:54
  • 1
    If you were using C++ then I’d suggest using a refinement-type to represent state-invariants like your `x != 0` guard. I dunno about C though, sorry. – Dai Oct 27 '22 at 05:55
  • 1
    Agree with the comments above. Having said that, you can try the following "trick": `y = !x + x;`. It will be 1 if x==0 and x otherwise. – wohlstad Oct 27 '22 at 06:01
  • 3
    A smart compiler is very likely to compile your original code with no branches, so no fancy tweaking is needed. For instance, gcc 12.2: [look ma, no jumps](https://godbolt.org/z/KffKqYvx7). – Nate Eldredge Oct 27 '22 at 06:03
  • @wohlstad Very interesting. Do you know about any assumptions (e.g. on compiler behaviour) in that? That would make a good answer. – Yunnosch Oct 27 '22 at 06:16
  • 1
    @Yunnosch added as an answer. – wohlstad Oct 27 '22 at 06:23
  • Performance optimization while looking at your code through a drinking straw like this is not likely to be successful. – Wyck Oct 30 '22 at 04:18

6 Answers6

9

Some general notes:

  1. As mentioned in other comments, some compilers can optimize and eliminate the branch. You can check the assembly output (e.g. in Godbolt) to make sure.
  2. Beware of premature optimizations.
  3. Always measure and make sure your speculation about what's taking up time is correct.

Having said that you can try the following "trick":

y = !x + x;

Assuming x,y are integer types:

  • If x==0, !x will be 1 and y will be assigned the value 1.
  • If x!=0, !x will be 0 and y will be assigned the value x.

Note: see @CostantinoGrana's comment below about the guarantee in the standard. You can also verify it in your specific environment (compiler etc.).

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
wohlstad
  • 12,661
  • 10
  • 26
  • 39
  • Nice. I wonder whether any compiler might represent !0 differently than 1. I could imagine 0xffff (16bit example). If you could dig up a standard quote which does or even does not guarantee... – Yunnosch Oct 27 '22 at 06:31
  • 4
    @Yunnosch [6.5.3.3 Unary arithmetic operators](https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf): 5 The result of the logical negation operator `!` is 0 if the value of its operand compares unequal to 0, 1 if the value of its operand compares equal to 0. The result has type `int`. The expression `!E` is equivalent to `(0==E)`. – Costantino Grana Oct 27 '22 at 06:42
  • @CostantinoGrana That seems to be the perfect input. – Yunnosch Oct 27 '22 at 06:45
  • 2
    Thanks! It worked, but finally I found the performence get worse after this operation. – JinZhu Oct 27 '22 at 09:17
  • I decide still use the branch way – JinZhu Oct 27 '22 at 09:18
  • 2
    @JinZhu If one branch is more likely (e.g. x==0 is rare) then it's common to find no gain or even losses in performance. – Persixty Oct 27 '22 at 09:45
  • How about`y = (!x)|x;`? – user16217248 Oct 28 '22 at 20:33
  • @user16217248 it can work too. For unsigned for sure. Not sure about signed bitwise operations on all compilers. – wohlstad Oct 29 '22 at 03:54
5

You should check the assembler output of your compiler. For example,the X86 architecture has an instruction called cmov (https://www.felixcloutier.com/x86/cmovcc) that is designed to handle this kind of thing without branches. The Arm architecture allows pretty much all instructions to be executed depending on CPU flags: https://developer.arm.com/documentation/dui0473/m/condition-codes/conditional-execution-in-arm-state.

Thus, when optimizations are enabled your compiler will likely produce no branches.

Sami Sallinen
  • 3,203
  • 12
  • 16
4

If you need this non-zero guarantee for an input to __builtin_clz (which gives an undefined result for an input of zero, otherwise counts leading-zero bits), a useful trick is y = x|1. That doesn't affect the number of leading zeros for any non-zero input, because you'd setting the last position CLZ will look at, only if all the higher bits are zero. If x=1, then x|1 is still 1.

Similarly as setup for __builtin_ctz (count trailing zeros), y = x | (1UL<<31) will set the MSB. Or more portably, x | ( 1ULL << (sizeof(x)*CHAR_BIT-1) ).

Unconditionally setting one of the bits will change some non-zero values; this is only useful in certain cases.


To get exactly the semantics you asked for, y = x ? x : 1 can be expressed that way, or as x + !x as suggested in another answer, which asks the compiler to materialize a boolean integer and add it.

Of course, a smart compiler will hopefully pick a way that's efficient on the target machine. For x86, one might hope that a smart compiler would do something like this if it's ok to overwrite x:

    cmp  eax, 1          ; CF=1 only if EAX==0 because only 0 is < 1U
    adc  eax, 0          ; x += (eax < 1U)

While for AArch64, tst or cmp and then cinc can conditionally copy-and-increment a value depending on any flags conditions.

On the Godbolt compiler explorer -

int make_nonzero_booleanize(int x){
    return x + !x;
}

int make_nonzero_ternary(int x){
    return x ? x : 1;
}

int make_nonzero_if(int x){
  // the original.  Compilers may compile it to branchless code
  // especially gcc -O3 instead of -O2 is more aggressive about if-conversion
    int y;
    if (x)
        y = x;
    else
        y = 1;
    return y;
}
# GCC12 -O3 for x86-64
make_nonzero_booleanize(int):           # return x + !x
        cmp     edi, 1
        mov     eax, edi
        adc     eax, 0       # Apparently a GCC dev already thought of this peephole and taught GCC10 about it
           # I was expecting compilers to naively do xor-zeroing + test/setcc + add
        ret

make_nonzero_ternary(int):
        mov     eax, edi
        test    edi, edi
        mov     edx, 1
        cmove   eax, edx     # standard ternary using conditional-move
        ret

make_nonzero_if(int):
         # same asm as ternary; GCC did an if-conversion optimization
// ARM64 gcc12 -O2
make_nonzero_booleanize(int):
        cmp     w0, 0
        cinc    w0, w0, eq       // return x==0 ? x : x+1
        ret
make_nonzero_ternary(int):
        cmp     w0, 0
        csinc   w0, w0, wzr, ne  // return x==0 ? x : 0+1
        ret
make_nonzero_if(int):
     // identical to ternary, if-conversion even with -O2

Both ways are equally good; the csinc result still has a data dependency on the input w0 even if the condition is true and the result comes from incrementing the zero register (wzr).

# RISC-V 32-bit clang 15 -O3
make_nonzero_booleanize(int):
        seqz    a1, a0                # tmp = (x==0)
        add     a0, a0, a1            # return x + tmp
        ret

make_nonzero_ternary(int):
        mv      a1, a0                  # tmp = x
        li      a0, 1                   # retval = 1
        beqz    a1, .LBB1_2             # if (x != 0) {
        mv      a0, a1                    # retval = tmp
.LBB1_2:                                # }
        ret                             # return retval

make_nonzero_if(int):
      # identical to ternary, same branching.

Clang didn't even do a good job with the branching strategy it picked here; it could have done a bnez over a retval=1, avoiding both mv instructions. Unless there's some tuning thing where a branch over an mv is handled by some RISC-V microarchitectures as a conditional-move? Hopefully after inlining it wouldn't be this bad, and might put a branch target for the ==0 special case somewhere else, where it didn't have to jump over it.

So x + !x is better with some real-world compilers on x86-64 and RISC-V, and equal on AArch64.

If you care about this in some specific context with a specific compiler for some ISA, look at how it compiles in your code. (If you know anything about what's efficient in asm or not.)


Older GCC versions:

GCC7 and earlier does better with the ternary, avoiding the mov eax, edi, instead doing mov eax, 1 and then using cmovnz to copy EDI or not. Hopefully the inefficiency in GCC8 and later only happens when not inlining this tiny function; GCC is known to waste instructions when it has to work around hard register constraints from calling conventions or inline asm.

GCC9 and earlier doesn't know the cmp/adc trick for x86-64, instead using a literal implementation of x + !x, which is unfortunately not great because x86 sucks at materializing a boolean condition as a 0/1 integer due to the poor design of setcc.

# GCC9 and earlier  -O3
make_nonzero_booleanize(int):
        xor     eax, eax       # zero the full reg
        test    edi, edi       # set FLAGS according to x
        sete    al             # EAX = !x
        add     eax, edi       # EAX += x
        ret

To be fair, adc eax, 0 costs 2 uops on Intel before Sandybridge. (And GCC tuning choices maybe didn't realize that adc with immediate 0 was special cased; in general adc costs 2 uops on Intel before Broadwell.)


Branchless might be slower if x == 0 is very rare

Even in the best case, the branchless code for x86-64, AArch64, and RISC-V is lengthening the critical path dependency chain by 2 instructions, both with 1 cycle latency on most CPUs. (For example, csinc can't run until the flags result of cmp is ready.)

But a correctly predicted branch is a control dependency, so it's handled separately by the CPU, optimistically assuming that y=x will be the case, only bailing out if it later detects a mispredict.

But writing an if() in the C source doesn't guarantee you'll get branchy asm. (That or an || makes it more likely to compile branchy than other ways of writing it, though.)

Related:

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
1

Despite the fact that the question has been answered, you can do something like this: x==0?!x: x, However, the condition is still present. Although it's not the best, it will at least guarantee that x won't be zero in any case.

M.Abdullah Iqbal
  • 219
  • 1
  • 17
-2

Maybe you can do this just like you are doing it in perl: (void)((y = x) || (y = x + 1));

Yan Huang
  • 1
  • 1
  • 1
    Maybe you can turn this into a more assertive answer. Try for [answer]. You could try and base your answer on a test result. Note that "Try whether helps." is more a comment than an answer. Compare https://meta.stackexchange.com/questions/214173/why-do-i-need-50-reputation-to-comment-what-can-i-do-instead – Yunnosch Oct 27 '22 at 06:41
  • `||` is specified as short-circuiting, thus branching on `y=x` being 0. In terms of hinting the compiler towards branchless code, it's not better. In practice it compiles the same as the `if` in the question. https://godbolt.org/z/jvjv4Er63 It's also not very readable or idiomatic for C. – Peter Cordes Oct 27 '22 at 20:17
  • As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Oct 31 '22 at 10:05
-2

well you could always do (x * x) + 1 a number squared can never be negative (unless imaginary) and if both equal zero the result would be 1

  • 1
    `x*x` can overflow, which is undefined behaviour for signed integers in C. Are you assuming `x` is a floating-point number, so overflow would wrap to +Infinity? Or that it's unsigned, so `x*x` could be zero, but will always have its low bit clear? Anyway, if you don't care about keeping `x` the same for non-zero `x`, it's much more efficient to do `x|1` or something. – Peter Cordes Oct 27 '22 at 16:20