How can I make branchless code?

Question

Related to this answer: https://stackoverflow.com/a/11227902/4714970

In the above answer, it's mentioned how you can avoid branch prediction fails by avoiding branches.

The user demonstrates this by replacing:

if (data[c] >= 128)
{
    sum += data[c];
}

With:

int t = (data[c] - 128) >> 31;
sum += ~t & data[c];

How are these two equivalent (for the specific data set, not strictly equivalent)?

What are some general ways I can do similar things in similar situations? Would it always be by using >> and ~?

http://www.hackersdelight.org/ is a nice collection of short functions, often exploiting clever bit-hacks. I think there's another collection that focuses more on bit-hacks like you're talking about, but I can't think of it atm. — Peter Cordes, Aug 19 '15 at 23:32
Some compilers can replace the conditional operator `?:` with a branchless `cmov` instruction. — CodesInChaos, Aug 20 '15 at 08:58
Fortunately, it's archived: https://web.archive.org/web/20190915025154/http://www.hackersdelight.org/ — kaartic, Aug 21 '20 at 17:33

score 41 · Accepted Answer · edited Aug 20 '15 at 08:28

int t = (data[c] - 128) >> 31;

The trick here is that if data[c] >= 128, then data[c] - 128 is nonnegative, otherwise it is negative. The highest bit in an int, the sign bit, is 1 if and only if that number is negative. >> is a shift that extends the sign bit, so shifting right by 31 makes the whole result 0 if it used to be nonnegative, and all 1 bits (which represents -1) if it used to be negative. So t is 0 if data[c] >= 128, and -1 otherwise. ~t switches these possibilities, so ~t is -1 if data[c] >= 128, and 0 otherwise.

x & (-1) is always equal to x, and x & 0 is always equal to 0. So sum += ~t & data[c] increases sum by 0 if data[c] < 128, and by data[c] otherwise.

Many of these tricks can be applied elsewhere. This trick can certainly be generally applied to have a number be 0 if and only if one value is greater than or equal to another value, and -1 otherwise, and you can mess with it some more to get <=, <, and so on. Bit twiddling like this is a common approach to making mathematical operations branch-free, though it's certainly not always going to be built out of the same operations; ^ (xor) and | (or) also come into play sometimes.

*if data[c] >= 128, then ...* - **only if the subtraction doesn't have signed wraparound.** e.g. `INT_MIN - 128` wraps to become a positive integer. This is why signed conditions in asm check the Overflow and Sign flags, not just Sign. e.g. on x86, [`cmovge`](https://www.felixcloutier.com/x86/cmovcc) (greater or equal) checks `SF == OF`. Only looking at the sign bit is safe if `data[c]` is a narrower type that gets sign-extended to `int`, because that makes signed overflow impossible. Or in this case if you know that `data[c] >= INT_MIN+128` will always be true. — Peter Cordes, Dec 02 '20 at 20:08

apangin · Answer 2 · 2015-08-20T00:52:08.370

15

While Louis Wasserman's answer is correct, I want to show you a more general (and much clearer) way to write branchless code. You can just use ? : operator:

    int t = data[c];
    sum += (t >= 128 ? t : 0);

JIT compiler sees from the execution profile that the condition is poorly predicted here. In such cases the compiler is smart enough to replace a conditional branch with a conditional move instruction:

    mov    0x10(%r14,%rbp,4),%r9d  ; load R9d from array
    cmp    $0x80,%r9d              ; compare with 128
    cmovl  %r8d,%r9d               ; if less, move R8d (which is 0) to R9d

You can verify yourself that this version works equally fast for both sorted and unsorted array.

edited Aug 20 '15 at 00:52

answered Aug 19 '15 at 23:52

apangin

92,924
10
193
247

4

Note that when branches *are* predictable, turning control dependencies into data dependencies on the critical path can be a bad thing. `cmov` puts both its inputs into the dependency chain. Also, good compilers can often convert simple `if` clauses into `cmov`. – Peter Cordes Aug 20 '15 at 00:15
Yes doing this improves the speed of both sorted and unsorted, but if it's unsorted it's not as effective as making it branchless. See results -> Unsorted = 11.3 seconds, ?: op = 9.18 seconds, Branchless sorted = 5.89 seconds, Branchless unsorted = 5.86, Sorted = 5.56 seconds and ?:op sorted = 4.77 seconds – Aequitas Aug 20 '15 at 00:22
@Aequitas What CPU and Java version do you run? On my PC the results are equal for both sorted and unsorted. Note parenthesis around the expression - they are necessary here for producing `cmov` (due to int-to-long conversion issues). – apangin Aug 20 '15 at 00:31
Java 1.8_51 i7 4930MX. Similar results with 1.8_60 – Aequitas Aug 20 '15 at 00:35
@Aequitas Seems like 32-bit Java. OK, try `int t = data[c]; sum += (t >= 128 ? t : 0);` – apangin Aug 20 '15 at 00:44
yes 32 bit java. That change makes it slower for sorted arrays to ~5.7 comparable but faster than branchless. But greatly speeds up the unsorted one to ~4.89 comparable to the original ?: op line you had – Aequitas Aug 20 '15 at 00:48
`? :` ternary operator `https://en.wikipedia.org/wiki/%3F:` +1 for this answer as it is better readable! – Preexo Aug 20 '15 at 07:19
3

The problem with this approach is that you're relying on compiler optimization. So if you want to be sure that you can branchless code (e.g. to avoid timing attacks in cryptography code), using the ugly workaround is still a good idea (even if it isn't a 100% guarantee either). – CodesInChaos Aug 20 '15 at 09:00
7

`? :` is a branch, it just has the advantage over `if()` that it's an expression rather than a statement. An inline function with an `if()` and two `return`s would be basically equivalent to a `? :`. If it results in branchless machine language, that's due to optimization. See @aki-suihkonen's answer http://stackoverflow.com/a/32113018/260122 for a longer discussion. – clacke Aug 20 '15 at 13:20
1

@clacke I don't say it is not a branch. Neither I say it is. It is an operator, a language construction. It may or may not be compiled to a branch. In Java bytecode there is a branch (just because there are no other conditional bytecodes except branches). But what I say is that JIT compiler recognizes this particular bytecode pattern and translates it to a conditional move instruction. – apangin Aug 20 '15 at 14:12
@apangin: eww, 32bit java? Try using the `server` VM, if it still isn't the default in 32bit JVMs. It's the default in 64bit JVMs. It optimizes more. – Peter Cordes Aug 20 '15 at 14:23
`? :` is in no way a "more general" way to write branchless code, because it isn't a "way" to write branchless code at all. Semantically `? :` creates a branch, you even say it's visible in the bytecode. The JIT may or may not turn branched code into branchless code. – clacke Aug 22 '15 at 19:21
Conditionals create branches, otherwise how does the conditional act on the condition? So maybe another way to ask the question is: How does one create code without conditionals? – clacke Aug 22 '15 at 19:23
@clacke I completely agree with your point on conditionals. My answer is not about writing code without conditionals. However, when we talk about microarchitecture, a branch is a little bit different concept. For example, in ARM architecture almost every instruction can be conditional. You can think of conditional instructions as functions with an extra argument. So, in the context of *branch prediction* I think my answer still makes sense. – apangin Aug 23 '15 at 00:46
Though if you are not content with the wording, please feel free to edit my answer. – apangin Aug 23 '15 at 00:49
Interesting point about ARM conditional instructions. From a branch prediction standpoint, I guess you are right that they should be considered branchless. So strike my equating branchless with conditionless. I maintain that the syntax choice of `if()` vs `? :` is irrelevant to the question asked. – clacke Aug 24 '15 at 14:53

Aki Suihkonen · Answer 3 · 2015-08-20T11:26:12.710

Branchless code means typically evaluating all possible outcomes of a conditional statement with a weight from the set [0, 1], so that the Sum{ weight_i } = 1. Most of the calculations are essentially discarded. Some optimization can result from the fact, that E_i doesn't have to be correct when the corresponding weight w_i (or mask m_i) is zero.

  result = (w_0 * E_0) + (w_1 * E_1) + ... + (w_n * E_n)    ;; or
  result = (m_0 & E_0) | (m_1 & E_1) | ... | (m_n * E_n)

where m_i stands for a bitmask.

Speed can be achieved also through parallel processing of E_i with a horizontal collapse.

This is contradictory to the semantics of if (a) b; else c; or it's ternary shorthand a ? b : c, where only one expression out of [b, c] is evaluated.

Thus ternary operation is no magic bullet for branchless code. A decent compiler produces branchless code equally for

t = data[n];
if (t >= 128) sum+=t;

vs.

    movl    -4(%rdi,%rdx), %ecx
    leal    (%rax,%rcx), %esi
    addl    $-128, %ecx
    cmovge  %esi, %eax

Variations of branchless code include presenting the problem through other branchless non-linear functions, such as ABS, if present in the target machine.

e.g.

 2 * min(a,b) = a + b - ABS(a - b),
 2 * max(a,b) = a + b + ABS(a - b)

or even:

 ABS(x) = sqrt(x*x)      ;; caveat -- this is "probably" not efficient

In addition to << and ~, it may be equally beneficial to use bool and !bool instead of (possibly undefined) (int >> 31). Likewise, if the condition evaluates as [0, 1], one can generate a working mask with negation:

-[0, 1] = [0, 0xffffffff]  in 2's complement representation

How can I make branchless code?

3 Answers3

Linked

Related