12

Do you think there is room for optimizations in the function haswon (see below)?

I recognized that changing the argument type from __int64 to unsigned __int64 made the function faster, thus i thougt maybe there is still a chance for optimization.

In more detail: I am writing a connect four game. Recently i used the Profiler Very Sleepy and recognized that the function haswon uses much of the cpu-time. The function uses a bitboard-representation of the connect-four-board for one player. The function itself i found in the sources of the fourstones benchmark. The bitboard representation is following:

.  .  .  .  .  .  .  TOP
5 12 19 26 33 40 47
4 11 18 25 32 39 46
3 10 17 24 31 38 45
2  9 16 23 30 37 44
1  8 15 22 29 36 43
0  7 14 21 28 35 42  BOTTOM

The function:

// return whether newboard includes a win
bool haswon(unsigned __int64 newboard)
{
    unsigned __int64 y = newboard & (newboard >> 6);
    if (y & (y >> 2 * 6)) // check \ diagonal
        return true;
    y = newboard & (newboard >> 7);
    if (y & (y >> 2 * 7)) // check horizontal -
        return true;
    y = newboard & (newboard >> 8);
    if (y & (y >> 2 * 8)) // check / diagonal
        return true;
    y = newboard & (newboard >> 1);
    if (y & (y >> 2))     // check vertical |
        return true;
    return false;
}

Thanks!

Edit: CPU is x86, 32 Bit Architecture, i'm using the Compiler from the Visual Studio 2008 Express Edition. Optimization Flags are /O2 /Oi /GL.

I tried the function haswon2 which Ben Jackson suggested. The assemblies from the Microsoft Compiler, with the default optimization flags for release versions (/O2 /Oi /GL), showing nearly no runtime differences. It looks like that the VC-Compiler in comparison to gcc can not take advantage that it must not evaluate each condition in strict order.

Results: haswon original: haswon

haswon2 from Ben Jackson: haswon2

Edit2: Assembly of haswon:

00401A10  mov         eax,dword ptr [esp+4] 
00401A14  mov         ecx,dword ptr [esp+8] 
00401A18  push        ebx  
00401A19  push        esi  
00401A1A  push        edi  
00401A1B  mov         edx,eax 
00401A1D  mov         edi,ecx 
00401A1F  shrd        edx,edi,6 
00401A23  mov         esi,edx 
00401A25  shr         edi,6 
00401A28  and         esi,eax 
00401A2A  and         edi,ecx 
00401A2C  mov         edx,esi 
00401A2E  mov         ebx,edi 
00401A30  shrd        edx,ebx,0Ch 
00401A34  shr         ebx,0Ch 
00401A37  and         edx,esi 
00401A39  and         ebx,edi 
00401A3B  or          edx,ebx 
00401A3D  je          `anonymous namespace'::haswon+35h (401A45h) 
00401A3F  mov         al,1 
00401A41  pop         edi  
00401A42  pop         esi  
00401A43  pop         ebx  
00401A44  ret              
00401A45  mov         edx,eax 
00401A47  mov         edi,ecx 
00401A49  shrd        edx,edi,7 
00401A4D  mov         esi,edx 
00401A4F  shr         edi,7 
00401A52  and         esi,eax 
00401A54  and         edi,ecx 
00401A56  mov         edx,esi 
00401A58  mov         ebx,edi 
00401A5A  shrd        edx,ebx,0Eh 
00401A5E  shr         ebx,0Eh 
00401A61  and         edx,esi 
00401A63  and         ebx,edi 
00401A65  or          edx,ebx 
00401A67  jne         `anonymous namespace'::haswon+2Fh (401A3Fh) 
00401A69  mov         edx,eax 
00401A6B  mov         edi,ecx 
00401A6D  shrd        edx,edi,8 
00401A71  mov         esi,edx 
00401A73  shr         edi,8 
00401A76  and         esi,eax 
00401A78  and         edi,ecx 
00401A7A  mov         edx,esi 
00401A7C  mov         ebx,edi 
00401A7E  shrd        edx,ebx,10h 
00401A82  shr         ebx,10h 
00401A85  and         edx,esi 
00401A87  and         ebx,edi 
00401A89  or          edx,ebx 
00401A8B  jne         `anonymous namespace'::haswon+2Fh (401A3Fh) 
00401A8D  mov         edx,eax 
00401A8F  mov         esi,ecx 
00401A91  shrd        edx,esi,1 
00401A95  shr         esi,1 
00401A97  and         esi,ecx 
00401A99  and         edx,eax 
00401A9B  mov         eax,edx 
00401A9D  mov         ecx,esi 
00401A9F  shrd        eax,ecx,2 
00401AA3  shr         ecx,2 
00401AA6  and         eax,edx 
00401AA8  and         ecx,esi 
00401AAA  or          eax,ecx 
00401AAC  jne         `anonymous namespace'::haswon+2Fh (401A3Fh) 
00401AAE  pop         edi  
00401AAF  pop         esi  
00401AB0  xor         al,al 
00401AB2  pop         ebx  
00401AB3  ret    
Christian Ammer
  • 7,464
  • 6
  • 51
  • 108
  • 3
    Surely this function runs once per move? Why does it matter if it takes 1 microsecond or 1 millisecond? – Oliver Charlesworth Nov 23 '10 at 21:54
  • This almost certainly doesn't need optimization. – Paul Nov 23 '10 at 21:55
  • 5
    The function is called by two other functions within a alpha-beta game-tree search. The other functions are 'getMoves' which tests for win or zugzwang, and 'evaluate' which tests if the board includes a win. The function is really called very often. – Christian Ammer Nov 23 '10 at 22:03
  • @Christian: I see. Well, I can't offer any specific advice about the function above, but have you considered taking advantage of the fact that after every move, if there is a line of 4, it *must* include the new piece? – Oliver Charlesworth Nov 23 '10 at 22:09
  • @Oli it could matter if it's used by AI. – ruslik Nov 23 '10 at 22:17
  • What CPU ? What compiler ? If x86 then is this a 32-bit executable or 64-bit ? Is a CPU-specific optimisation acceptable (e.g. use SIMD ?) ? – Paul R Nov 23 '10 at 22:24
  • @Oli, @Paul: in game programming, it is common to have this kind of code run zillions of times per move, when simulating possible countermoves, following moves, following countermoves, etc. – Fred Foo Nov 23 '10 at 22:39
  • It depends. Can you post the generated assembly? There are a couple of things in the code that *could* be improved, but most likely the compiler already does it, so there'd be nothing to gain by trying to optimize it yourself. – jalf Nov 23 '10 at 22:46
  • @Paul R: I posted the CPU and compiler information into the question. – Christian Ammer Nov 24 '10 at 21:39
  • Thanks for posting a followup. You should set up a benchmark that doesn't require function profiling: Function profilers add overhead to every call that causes inner functions to look disproportionately bad. You've probably already noticed your runtime is much longer with profiling on. – Ben Jackson Nov 24 '10 at 21:41
  • @Oli Charlesworth: Hmm, i'm not realy sure if i understand you right with 'it must include the new piece'. The function 'getMoves' returns 0 moves if the board includes a win. It returns 1 move (the winning move) if there is such one. It returns 1 move if there is a zugzwang-situation. In any other cases it returns all moves of not full columns. – Christian Ammer Nov 24 '10 at 21:47
  • @jalf: I posted the generated assembly into the question. – Christian Ammer Nov 24 '10 at 22:27

1 Answers1

17

The idea behind this version is to avoid the strict testing order (the intermediate returns force the compiler to evaluate the conditions one at a time, in order) as well as the branching associated with multiple if statements:

// return whether newboard includes a win
bool haswon2(uint64_t newboard)
{
    uint64_t y = newboard & (newboard >> 6);
    uint64_t z = newboard & (newboard >> 7);
    uint64_t w = newboard & (newboard >> 8);
    uint64_t x = newboard & (newboard >> 1);
    return (y & (y >> 2 * 6)) | // check \ diagonal
           (z & (z >> 2 * 7)) | // check horizontal -
           (w & (w >> 2 * 8)) | // check / diagonal
           (x & (x >> 2));      // check vertical |
}

With a decent level of optimization you can really think of w, x, y and z as "aliases" for the shifted values. This means the final return statement throws the entire operation into a big soup for the compiler to play with. On my system this version takes only 65% of the runtime of the original (including the overhead of generating a random position each time). It might win by a larger percentage if the boards are mainly non-winning.

Looking at the disassembly of each (from gcc -O3) the original version is actually shorter, so it's likely the lack of branching in the tight inner loop that really helps.

Ben Jackson
  • 90,079
  • 9
  • 98
  • 150
  • Is there any reason why the compiler couldn't perform these optimizations from the original code? I can't really see any reason (there are no pointers or aliasing issues, no array lookups or side effects that might prohibit this kind of code reordering. So is this just a case of GCC's compiler being not good enough, or is there some aspect to the original code that means that it *cannot* be transformed into code like yours automatically? – jalf Nov 23 '10 at 22:54
  • If the compiler could see that there were no side effects beyond the first condition and that all of the conditions merged into the same result (it actually does figure that part out) it seems like it could. Perhaps someone with a recent `clang` install could try that compiler? – Ben Jackson Nov 23 '10 at 22:58
  • 1
    @jalf Don't supraestimate it. It's not able to optimize an algorithm. Read this: http://www.agner.org/optimize/optimizing_cpp.pdf – ruslik Nov 23 '10 at 22:59
  • The compiler is not good enough. It would have to recognize that this sequence of statements performs a logical or, in a shortcut manner, that there are no side effects (so it is semantically equivalent to a bitwise or), and (and this will likely never happen) that it is faster not to use the shortcut (which actually hasn't been confirmed yet). – Martin v. Löwis Nov 23 '10 at 22:59
  • FWIW it's almost certainly not faster to use the shortcutting version for this particular algorithm because the vast majority of tests will be on non-winning boards (winning boards are leaves in the game tree). – Ben Jackson Nov 23 '10 at 23:03
  • It may actually be even better to change the return type to `uint64_t`, so that the compiler knows that it doesn't have to fold all non-zero values down to `1` (assuming that's a C99 `bool`). – caf Nov 23 '10 at 23:12
  • After any looping the dynamic branch prediction is going to saturate as "branch not predicted" and when inlined into an inner loop the tests are going to be fall-through except for the `return true` case anyway. I would also recommend testing when using such macros: I wrote a similarly dense bit testing inner loop and some `unlikely()` helped and some hurt. – Ben Jackson Nov 23 '10 at 23:14
  • @caf: In x86_64 that folds to one `setne %al` after then final `or`. It might go away in the inlined version when tested in `if(haswon...` but my test loop stored the result in a `volatile` to ensure it didn't optimize away. – Ben Jackson Nov 23 '10 at 23:16
  • @ruslik: I have read that before, and I'm not asking it to optimize an algorithm. I'm asking if there's any reason why it can't 1) identify that the code contains no side effects (this should be trivial in this case as it contains only built-in arithmetic integral operations), and therefore that eliminating the branches is safe, and 2) that eliminating the branches is going to be faster. What I wondered was simply if there was some aspect to the original code that made this optimization *impossible* for the optimizer to do safely. – jalf Nov 24 '10 at 01:05
  • @jalf I mentioned that book because there are lots of exaples of trivial code that compiler was unable to optimize for whatever reasons. – ruslik Nov 24 '10 at 03:25
  • 1
    @Ben Jackson: I read your answer when i was in work and was happy about your efforts. It seems that the Microsoft Compiler can't take an advantage of your idea. I posted the results in the question. Thank you for your version. – Christian Ammer Nov 24 '10 at 21:36
  • 1
    I'm considerably surprised that the compiler cannot perform this optimization itself. – Puppy Nov 24 '10 at 23:05
  • 1
    @Christian: if the answer doesn't solve the problem satisfactorily, there's nothing wrong with "un-accepting" it. As long as there is an accepted answer, fewer people will *see* the answer, and so you'll get fewer answers. – jalf Nov 25 '10 at 02:19