12

I realize I've got a great lack of knowledge in that area (fancy way of saying I don't know jack about it).

Is there some documentation as to how and when to use them?

MPelletier
  • 16,256
  • 15
  • 86
  • 137
  • 4
    Can you provide an example, or a reference? – Greg Hewgill May 26 '11 at 03:19
  • 1
    No :) I was hoping SO could, really. I read about it in this old [question](http://stackoverflow.com/questions/1801431/basic-skills-to-work-as-an-optimiser-in-the-gaming-industry), top answer. – MPelletier May 26 '11 at 03:21
  • Ok. Does this question help? http://stackoverflow.com/questions/4937067/branchless-binary-search – Greg Hewgill May 26 '11 at 03:22
  • 2
    This page has an explanation http://www.blueraja.com/blog/285/branchless-conditionals-compiler-optimization-technique – Bala R May 26 '11 at 03:22
  • Good question, I love bithacks. – Robert Massaioli May 26 '11 at 03:48
  • related/possible duplicate of [How efficient is an if statement compared to a test that doesn't use an if? (C++)](http://stackoverflow.com/questions/3009699/how-efficient-is-an-if-statement-compared-to-a-test-that-doesnt-use-an-if-c) – Kirill V. Lyadvinsky May 26 '11 at 05:05

3 Answers3

14

Apart from all the twiddling based branchless code (which won't cover everything, such as FP), you get instructions specifically meant for branchless code creation, these would be SETcc, FCMOVcc and CMOVcc under x86, which perform operations based on the condition flags from a comparison.

a really simple example would be (yes, the example is so simple that one would probably never write something like this, its only to demonstrated a point clearly):

bool CheckZero(int x)
{
    if(x == 0)
       return true;

    return false;
    //OR we can use: return (x == 0) ? true : false; 
}

now a simple x86 compiler might compile that down to:

    MOV EAX,[ESP + 4]
    TEXT EAX,EAX
    JE _TRUE
    XOR EAX,EAX
    RETN 4

_TRUE:
    MOV EAX,1
    RETN 4

an optimizing x86 compiler would take it down into the following branchless code (or similar):

MOV ECX,[ESP + 4]
XOR EAX,EAX
TEST ECX,ECX
SETE AL
RETN 4

A little more complex example can be found here.

However, this is something that a compiler will perform, and not some you should worry about yourself (at least not without analyzing the output of your compiler). but if the code is required to be branchless without fail, C++ doesn't provide enough control to do so, so you need to use (inline) assembly.

Necrolis
  • 25,836
  • 3
  • 63
  • 101
  • 5
    What would ever possess you to write `return /* boolean condition */ ? true : false;` instead of `return /* boolean condition */;` ? Seriously, `return x == 0;` works fine and doesn't need even an optimizing compiler to be branchless. – Chris Lutz May 26 '11 at 05:19
  • 4
    @chris: i wrote that for clarity and to show a point, nothing more. Also, technically `return (x == 0);` is not automatically branchless, just cause there is no `if` or ternary selector, it depends on the compiler optimizations. – Necrolis May 26 '11 at 09:28
  • 2
    It is branchless - the value of `x == 0` is a `bool` which is just returned from the function as-is, so it's already optimal. – molbdnilo May 26 '11 at 10:53
  • 11
    @molbdnilo: Just cause the return type of the expression is a `bool` doesn't mean its evaluation will be branchless(its not a constant expression). your confusing language standards with the actual implementation at machine level(where it matters), the compiler may emit a branch, a `SETcc` based version or a twiddling based version, depends on the optimizations applied. – Necrolis May 26 '11 at 13:41
9

I had written ternary logic simulator not so long ago, and this question was viable to me, as it directly affects my interpretator execution speed; I was required to simulate tons and tons of ternary logic gates as fast as possible.

In a binary-coded-ternary system one trit is packed in two bits. Most significant bit means negative and least significant means positive one. Case "11" should not occur, but it must be handled properly and threated as 0.

Consider inline int bct_decoder( unsigned bctData ) function, which should return our formatted trit as regular integer -1, 0 or 1; As i observed there are 4 approaches: i called them "cond", "mod", "math" and "lut"; Lets investigate them

First is based on jz|jnz and jl|jb conditional jumps, thus cond. Its performance is not good at all, because relies on a branch predictor. And even worse - it varies, because it is unknown if there will be one branch or two a priori. And here is an example:

inline int bct_decoder_cond( unsigned bctData ) {
   unsigned lsB = bctData & 1;
   unsigned msB = bctData >> 1;
   return
       ( lsB == msB ) ? 0 : // most possible -> make zero fastest branch
       ( lsB > msB ) ? 1 : -1;
}

This is slowest version, it could involve 2 branches in worst case and this is something where binary logic fails. On my 3770k it prodices around 200MIPS on average on random data. (here and after - each test is average from 1000 tries on randomly filled 2mb dataset)

Next one relies on modulo operator and its speed is somewhere in between first and third, but is definetely faster - 600 MIPS:

inline int bct_decoder_mod( unsigned bctData ) {
    return ( int )( ( bctData + 1 ) % 3 ) - 1;
}

Next one is branchless approach, which involves only maths, thus math; it does not assume jump instrunctions at all:

inline int bct_decoder_math( unsigned bctData ) {
    return ( int )( bctData & 1 ) - ( int )( bctData >> 1 );
}

This does what is should, and behaves really great. To compare, performance estimate is 1000 MIPS, and it is 5x faster than branched version. Probably branched version is slowed down due to lack of native 2-bit signed int support. But in my application it is quite good version in itself.

If this is not enough then we can go futher, having something special. Next is called lookup table approach:

inline int bct_decoder_lut( unsigned bctData ) {
    static const int decoderLUT[] = { 0, 1, -1, 0 };
    return decoderLUT[ bctData & 0x3 ];
}

In my case one trit occupied only 2 bits, so lut table was only 2b*4 = 8 bytes, and was worth trying. It fits in cache and works blazing fast at 1400-1600 MIPS, here is where my measurement accuracy is going down. And that is is 1.5x speedup from fast math approach. That's because you just have precalculated result and single AND instruction. Sadly caches are small and (if your index length is greater than several bits) you simply cannot use it.

So i think i answered your question, on what what could branched/branchless code be like. Answer is much better and with detailed samples, real world application and real performance measurements results.

xakepp35
  • 2,878
  • 7
  • 26
  • 54
5

http://hbfs.wordpress.com/2008/08/05/branchless-equivalents-of-simple-functions/

I think (though I don't know more than what I read on the page) it is a way of getting if functionality without the branching (which makes sense based on the words branchless if ;)). Don't know more.

Thank Mr. Google.

soandos
  • 4,978
  • 13
  • 62
  • 96
  • Yeah, my Google-fu is weak tonight. – MPelletier May 26 '11 at 03:26
  • I don't know this for a fact, but I assume (actually hope is more like it) that compilers can do this optimization on their own. I have no idea if that is the case though. – soandos May 26 '11 at 03:27