8

I have some critical branching code inside a loop that's run about 2^26 times. Branch prediction is not optimal because m is random. How would I remove the branching, possibly using bitwise operators?

bool m;
unsigned int a;
const unsigned int k = ...; // k >= 7
if(a == 0)
    a = (m ? (a+1) : (k));
else if(a == k)
    a = (m ?     0 : (a-1));
else
    a = (m ? (a+1) : (a-1));

And here is the relevant assembly generated by gcc -O3:

.cfi_startproc
movl    4(%esp), %edx
movb    8(%esp), %cl
movl    (%edx), %eax
testl   %eax, %eax
jne L15
cmpb    $1, %cl
sbbl    %eax, %eax
andl    $638, %eax
incl    %eax
movl    %eax, (%edx)
ret
L15:
cmpl    $639, %eax
je  L23
testb   %cl, %cl
jne L24
decl    %eax
movl    %eax, (%edx)
ret
L23:
cmpb    $1, %cl
sbbl    %eax, %eax
andl    $638, %eax
movl    %eax, (%edx)
ret
L24:
incl    %eax
movl    %eax, (%edx)
ret
.cfi_endproc
scientiaesthete
  • 919
  • 9
  • 20
  • 5
    Before you jump into this, you might want to make sure the compilers you care about aren't actually able to optimize these into conditional moves. – Mysticial Aug 19 '12 at 21:09
  • Is 0<=a<=k a valid assumption? – Eric Aug 19 '12 at 21:13
  • @Eric Yes, 0<=a<=k is always true. – scientiaesthete Aug 19 '12 at 21:16
  • I'm quite surprised that GCC isn't able to turn the inner branches into conditional moves. The ones you have here are a lot simpler than the one in [this question](http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array) where `-O3` does the job. – Mysticial Aug 19 '12 at 21:35
  • Looks like it's basically `a = a + (m?1:-1) % k`; One branch which a two-element LUT could fix. Or have I got negative modulo wrong...? – Roddy Aug 19 '12 at 22:10

6 Answers6

4

The fastest I've found is now the table implementation

Timings I got (UPDATED for new measurement code)

HVD's most recent: 9.2s

Table version: 7.4s (with k=693)

Table creation code:

    unsigned int table[2*k];
    table_ptr = table;
    for(int i = 0; i < k; i++){
      unsigned int a = i;
      f(0, a);
      table[i<<1] = a;

      a = i;
      f(1, a);
      table[i<<1 + 1] = a;
    }

Table runtime loop:

void f(bool m, unsigned int &a){
  a = table_ptr[a<<1 | m];
}

With HVD's measurement code, I saw the cost of the rand() dominating the runtime, so that the runtime for a branchless version was about the same range as these solutions. I changed the measurement code to this (UPDATED to keep random branch order, and pre-computing random values to prevent rand(), etc. from trashing the cache)

int main(){
  unsigned int a = k / 2;
  int m[100000];
  for(int i = 0; i < 100000; i++){
    m[i] = rand() & 1;
  }

  for (int i = 0; i != 10000; i++
  {
    for(int j = 0; j != 100000; j++){
      f(m[j], a);  
    }
  }
}
Eric
  • 3,142
  • 17
  • 14
  • Nice! I created a test program and measured, the original version takes ~4.8s, my branch-free version takes ~5.3s so doesn't buy anything (which I had noted in my now deleted answer), yours takes ~4.2s. –  Aug 19 '12 at 22:19
  • Wow. The branch-free division-free modulo is *still* worse than branching. See my new answer. –  Aug 19 '12 at 22:32
  • I'm not seeing the benefits you get in your latest version. Yes, `rand()` will take up most of the time, but the time it's taking should be constant, so that doesn't matter, the absolute difference between the different versions is still a useful measurement. I get ~4.3s with that version. –  Aug 19 '12 at 23:21
  • Either way, I just checked and think a table-lookup significantly beats both of our solutions. – Eric Aug 19 '12 at 23:33
  • I already included that in my tests, and it doesn't (probably because it accesses memory in an unpredictable order). –  Aug 19 '12 at 23:38
  • Try without the call the rand() possibly knocking parts of the table out of cache each time. – Eric Aug 19 '12 at 23:40
  • Will do so later, but I would like to leave the unpredictable switches between `f(false, ...)` and `f(true, ...)`. In your version, you call one or the other 100000 times in a row, which could make a huge difference, especially if `f` gets inlined. –  Aug 19 '12 at 23:50
  • Updated to maintain random branching order. – Eric Aug 20 '12 at 00:38
  • More bad news for testing: in your test code, I get nothing at all: gcc has inlined `f`, determined that the entire loop does nothing, and removed it. –  Aug 20 '12 at 06:15
  • But if I change my test program to `while (a != 0) f(m[++j] & 1, a);`, where `j` is `unsigned short` and `m` is `int[65536]` filled with `rand()` results, table lookup is still slower in my testing. –  Aug 20 '12 at 06:23
4

The branch-free division-free modulo could have been useful, but testing shows that in practice, it isn't.

const unsigned int k = 639;
void f(bool m, unsigned int &a)
{
    a += m * 2 - 1;
    if (a == -1u)
        a = k;
    else if (a == k + 1)
        a = 0;
}

Testcase:

unsigned a = 0;
f(false, a);
assert(a == 639);
f(false, a);
assert(a == 638);
f(true, a);
assert(a == 639);
f(true, a);
assert(a == 0);
f(true, a);
assert(a == 1);
f(false, a);
assert(a == 0);

Actually timing this, using a test program:

int main()
{
    for (int i = 0; i != 10000; i++)
    {
        unsigned int a = k / 2;
        while (a != 0) f(rand() & 1, a);
    }
}

(Note: there's no srand, so results are deterministic.)

My original answer: 5.3s

The code in the question: 4.8s

Lookup table: 4.5s (static unsigned lookup[2][k+1];)

Lookup table: 4.3s (static unsigned lookup[k+1][2];)

Eric's answer: 4.2s

This version: 4.0s

  • Wow! Your version is indeed faster than mine and others, it shaved about 7 nanosecs off of 30 nanosecs / loop. Great improvement! Thanks for all your effort! – scientiaesthete Aug 19 '12 at 23:02
1

I don't think you can remove the branches entirely, but you can reduce the number by branching on m first.

if (m){
    if (a==k) {a = 0;} else {++a;}
}
else {
    if (a==0) {a = k;} else {--a;}
}
Antimony
  • 37,781
  • 10
  • 100
  • 107
1

Adding to Antimony's rewrite:

if (a==k) {a = 0;} else {++a;}

looks like an increase with wraparound. You can write this as

a=(a+1)%k;

which, of course, only makes sense if divisions are actually faster than branches.

Not sure about the other one; too lazy to think about what the (~0)%k will be.

Christian Stieber
  • 9,954
  • 24
  • 23
  • +1, gcc optimises divisions by constants very well (replacing it by multiplications that rely on integer overflow), but it should be `% (k + 1)`. For the subtraction, just add another `k + 1` before subtracting 1 and getting the remainder: `a = (a + k) % (k + 1);` –  Aug 19 '12 at 21:57
  • I've expanded on your answer in my own. –  Aug 19 '12 at 22:02
1

This has no branches. Because K is constant, compiler might be able to optimize the modulo depending on it's value. And if K is 'small' then a full lookup table solution would probably be even faster.

bool m;
unsigned int a;
const unsigned int k = ...; // k >= 7
const int inc[2] = {1, k};

a = a + inc[m] % (k+1);
Roddy
  • 66,617
  • 42
  • 165
  • 277
1

If k isn't large enough to cause overflow, you could do something like this:

int a; // Note: not unsigned int
int plusMinus = 2 * m - 1;
a += plusMinus;
if(a == -1) 
    a = k; 
else if (a == k+1) 
    a = 0; 

Still branches, but the branch prediction should be better, since the edge conditions are rarer than m-related conditions.

Michael
  • 617
  • 4
  • 3