14

OK, it may sound a bit complicated, but this is what I'm trying to do :

  • Take e.g. 10101010101
  • And return { 0, 2, 4, 6, 8, 10 } - an array with all of the positions of bits which are set

This is my code :

UINT DQBitboard::firstBit(U64 bitboard)
{
    static const int index64[64] = {
    63,  0, 58,  1, 59, 47, 53,  2,
    60, 39, 48, 27, 54, 33, 42,  3,
    61, 51, 37, 40, 49, 18, 28, 20,
    55, 30, 34, 11, 43, 14, 22,  4,
    62, 57, 46, 52, 38, 26, 32, 41,
    50, 36, 17, 19, 29, 10, 13, 21,
    56, 45, 25, 31, 35, 16,  9, 12,
    44, 24, 15,  8, 23,  7,  6,  5  };

    static const U64 debruijn64 = 0x07EDD5E59A4E28C2ULL;

    #pragma warning (disable: 4146)
    return index64[((bitboard & -bitboard) * debruijn64) >> 58];  
}

vector<UINT> DQBitboard::bits(U64 bitboard)
{
    vector<UINT> res;

    while (bitboard)
    {
        UINT first = DQBitboard::firstBit(bitboard);
        res.push_back(first);

        bitboard &= ~(1ULL<<first);
    }

    return res;
}

And the code surely does work.

My point is :

  • Is there any faster implementation you have in mind?
  • Do you notice anything that could be optimized? If so, what?

Hints :

  • UINT is a typedef of unsigned int
  • U64 is a typedef of unsigned long long
  • Both methods are static inline.
Dr.Kameleon
  • 22,532
  • 20
  • 115
  • 223
  • 2
    @K-ballo Chess Engine programming and using bitboards (64-bit unsigned integers) for Board representation and Move generation. :-) – Dr.Kameleon Dec 29 '12 at 23:34
  • Your find first set bit looks good, but I'm sure we could come up with a better way to get ALL the bits in a word. For example what Luchian describes. – Mats Petersson Dec 29 '12 at 23:34
  • 2
    Is your example correct - you seem to be returning the odd bits, where the ones are at even bit positions 0, 2, 4 ... – Mats Petersson Dec 29 '12 at 23:43
  • How many bits do you expect to be set? There is probably a threshold value, below which the lookup table approach is faster, and above it naive iteration is faster. – hyde Dec 30 '12 at 00:08
  • @hyde 0 to 27, but most often even less than 8. – Dr.Kameleon Dec 30 '12 at 00:10
  • @MatsPetersson Nope, you were right, just noticed it. – Dr.Kameleon Dec 30 '12 at 00:27
  • 1
    Check out http://en.wikipedia.org/wiki/Find_first_set – brian beuning Dec 30 '12 at 01:36
  • See also @Andrew asking http://stackoverflow.com/questions/20713017/what-is-the-fastest-way-to-return-the-positions-of-all-set-bits-in-a-64-bit-inte/20719768# – 9mjb Dec 24 '13 at 17:10

8 Answers8

10

Here is another suggestion that can be profiled (can be combined with other suggestions for further optimization). Note, the loop here is O(number of set bits).

vector<UINT> bits_set (UINT64 data) 
{
    UINT n;
    vector<UINT> res;
    res.reserve(64);
    for (n = 0; data != 0; n++, data &= (data - 1))
    {
        res.push_back(log2(data & ~(data-1)));
    }
    return res;
}
SomeWittyUsername
  • 18,025
  • 3
  • 42
  • 85
  • Whoa!!! Yep, **that** did the trick. FYI, around 20% speed gain compared to my version... Impressed. – Dr.Kameleon Dec 30 '12 at 00:08
  • I'm going to pick this one as an answer - since it's by far the fastest for my case. However, I'm still open to new suggestions. Thanks again, buddy! :-) – Dr.Kameleon Dec 30 '12 at 00:12
  • Is `log2` the debruijn64 lookup from the question? Or some faster implementation? – Ben Voigt Dec 30 '12 at 00:13
  • @BenVoigt It's the binary logarithm function (Logarithm with base 2). – Dr.Kameleon Dec 30 '12 at 00:15
  • 1
    @Dr.Kameleon: I know what the behavior is, I don't know what the implementation is (that will make or break the performance of this method). – Ben Voigt Dec 30 '12 at 00:29
  • 2
    @BenVoigt That's a good question. It's for the user to decide what will the implementation of `log2` be and if it's good enough - the algorithm doesn't restrict to any particular implementation; that's why I wrote "*this suggestion can be profiled*". Also should be noted that the algorithm's performance is directly coupled with the amount of set bits in the input and for all bits set it won't do any better than a simple iteration – SomeWittyUsername Dec 30 '12 at 06:18
7

Bit shifts are really cheap. Lookup tables cost cache space, and your lookup also have integer multiplication. Just brute-force will be faster than clever techniques, I expect.

vector<UINT> DQBitboard::bits(U64 bitboard)
{
    vector<UINT> res;
    res.reserve(64);
    uint_fast8_t pos = 1;

    do {
        if (bitboard & 1) res.push_back(pos);
        ++pos;
    } while (bitboard >>= 1);

    return res;
}

You can unroll the loop a little bit, that may make it faster yet.

The std::vector is the most expensive part by far. Consider using the bitboard directly. For example:

struct bitboard_end_iterator{};
struct bitboard_iterator
{
    U64 value;
    uint_fast8_t pos;

    bitboard_iterator(U64 bitboard) : value(bitboard), pos(0)
    {
        ++(*this);
    }
    UINT operator*() const { return pos + 1; }
    bool operator==( bitboard_end_iterator ) const { return pos == 64; }
    operator bool() const { return pos < 64; }
    bitboard_iterator& operator++()
    {
        while (U64 prev = value) {
            value >>= 1;
            ++pos;
            if (prev & 1) return *this;
        }
        pos = 64;
        return *this;
    }
};

Now you can write

for( bitboard_iterator it(bitboard); it; ++it )
    cout << *it;

and I think you'll get your list of bits.

Version 2:

class bitboard_fast_iterator
{
    U64 value;
    uint_fast8_t pos;

public:
    bitboard_fast_iterator(U64 bitboard = 0) : value(bitboard), pos(__builtin_ctzll(value)) {}
    UINT operator*() const { return pos + 1; }
    operator bool() const { return value != 0; }
    bitboard_iterator& operator++()
    {
        value &= ~(1ULL << pos);
        pos = __builtin_ctzll(value);
        return *this;
    }
};
Ben Voigt
  • 277,958
  • 43
  • 419
  • 720
  • Thanks a lot for all the effort. Just tested your version too, and you know what : surprisingly your version (which definitely looks fine and promising) is approximately 10% slower than mine... lol. I don't know why - probably it's because those U64 numbers usually have very few bits set? – Dr.Kameleon Dec 29 '12 at 23:48
  • @Dr.Kameleon: Did you benchmark it by itself, or in your complete program? The cost of the cache space won't show up in a microbenchmark, only when there's cache pressure. – Ben Voigt Dec 29 '12 at 23:51
  • This function is part of a much larger Chess Engine project of mine, and I've also written a small benchmarking suite attached to it, so that I can measure time/performance/etc of critical functions performed some millions of times... – Dr.Kameleon Dec 29 '12 at 23:53
  • @Dr.Kameleon: Right. But are you benchmarking the loops that run in your actual chess engine, or a loop containing just this function? In the microbenchmark the lookup table will easily fit in cache. In your real code, the lookup table might be evicted in favor of other data you're working on. – Ben Voigt Dec 29 '12 at 23:55
  • The Benchmark actually measures and performs the most critical part of the Engine : Move Generation. So, in a way, everything is being tested as part of the final program, not some testing loop... – Dr.Kameleon Dec 29 '12 at 23:58
  • @Dr.Kameleon: Ok, that's the right way to do it. I'd get rid of `std::vector` if you can. I think you could make a `bitboard_iterator` object that works with all your vector algorithms, but much faster. – Ben Voigt Dec 30 '12 at 00:00
  • Thanks a lot for the suggestion. I'll give it a try. Perhaps, the difference in speed with my version and all the others is what @MatsPetersson has guessed : my usual bitboards contain very few 1's, compared to 0's... Anyway, thanks again! :-) – Dr.Kameleon Dec 30 '12 at 00:04
  • 1
    @Dr.Kameleon: Have a look and see if that iterator works for you? Naturally you can do it with the DeBruijn lookup instead as well. – Ben Voigt Dec 30 '12 at 00:11
  • +1 I like the iterator approach and use this myself in my own board game engine. For good meausre, I wrap the `U64` in a class, and define `begin()` and `end()` iterators. This allows me to write `std::copy(begin(bitboard), end(bitboard), std::back_insert(res));` The iterator's `operator++` and `operator*` are implemented using the same technique as the g++ `Find_first` and `Find_next` extension to `std::bitset` (and use `__builtin_ctzll` to locate the first/next bit) – TemplateRex Jan 10 '14 at 23:16
6

I have been wondering whether it would work faster with the bst assembly instruction. So I tried 3 implementations and got the following results for 10 million iterations:

Your implementation (Dr.Kameleon) 1.77 seconds

The log2() implementation (icepack) 2.17 seconds

My assembly implementation (me) 0.16 seconds

Output:

bits version:
Function started at 0
           ended at 177
              spent 177 (1.770000 seconds)
c version:
Function started at 177
           ended at 394
              spent 217 (2.170000 seconds)
c version:
Function started at 394
           ended at 410
              spent 16 (0.160000 seconds)

One point about C/C++, static is horrendous. It is actually compiled in a list of CPU instructions (NOT what I would expect either!!!) Instead, use an array outside your function in a nameless namespace. That will have the expected effect. Although in assembly you can use the .long (or some other size) and then %rip to reference the data from the IP.

Note: once compiled, I do not see the size (n) being used in my assembly version so I'm not too sure whether the returned array is valid. Outside of that, the code itself becomes a loop of 5 assembly instructions, hence the tiny increase in speed (about x10).

The reason for log2() slowness is that it converts the number to an xmm register and then does call to another function. It then converts the xmm register back to the regular register.

#include <stdlib.h>
#include <stdio.h>
#include <inttypes.h>
#include <unistd.h>
#include <sys/times.h>
#include <string.h>
#include <math.h>
#include <vector>

using namespace std;

namespace
{
const int index64[64] = {
    63,  0, 58,  1, 59, 47, 53,  2,
    60, 39, 48, 27, 54, 33, 42,  3,
    61, 51, 37, 40, 49, 18, 28, 20,
    55, 30, 34, 11, 43, 14, 22,  4,
    62, 57, 46, 52, 38, 26, 32, 41,
    50, 36, 17, 19, 29, 10, 13, 21,
    56, 45, 25, 31, 35, 16,  9, 12,
    44, 24, 15,  8, 23,  7,  6,  5  };
const uint64_t debruijn64 = 0x07EDD5E59A4E28C2ULL;
}

int firstBit(uint64_t bitboard)
{
    return index64[((bitboard & -bitboard) * debruijn64) >> 58];  
}

vector<int> bits(uint64_t bitboard)
{
    vector<int> res;
    res.reserve(64);

    while(bitboard)
    {
        int first = firstBit(bitboard);
        res.push_back(first);

        bitboard &= ~(1ULL << first);
    }
    return res;
}



vector<int> bits_c(uint64_t bitboard)
{
    int n;
    vector<int> res;
    res.reserve(64);
    for (n = 0; bitboard != 0; n++, bitboard &= (bitboard - 1))
    {
        res.push_back(log2(bitboard & ~(bitboard - 1)));
    }
    return res;
}


vector<int> bits_asm(uint64_t bitboard)
{
    int64_t n(0);
    int res[64];
    asm(
    "bsf %[b], %%rax\n\t"
    "je exit\n\t"
    ".align 16\n"
"loop:\n\t"
    "mov %%eax, (%[r],%[n],4)\n\t"
    "btr %%rax, %[b]\n\t"
    "inc %[n]\n\t"
    "bsf %[b], %%rax\n\t"
    "je loop\n"
"exit:\n\t"
    : /* output */ "=r" (n)
    : /* input */ [n] "r" (n), [r] "r" (res), [b] "r" (bitboard)
    : /* state */ "eax", "cc"
    );
    return vector<int>(res, res + n);
}




class run_timer
{
public:
    run_timer()
    {
    }

    void start()
    {
        times(&f_start);
    }

    void stop()
    {
        times(&f_stop);
    }

    void report(const char *msg)
    {
        printf("%s:\n"
               "Function started at %ld\n"
               "           ended at %ld\n"
               "              spent %ld (%f seconds)\n",
               msg,
               f_start.tms_utime,
               f_stop.tms_utime,
               f_stop.tms_utime - f_start.tms_utime,
               (double)(f_stop.tms_utime - f_start.tms_utime)/(double)sysconf(_SC_CLK_TCK));
    }

    struct tms f_start;
    struct tms f_stop;
};


int main(int argc, char *argv[])
{
    run_timer t;

    t.start();
    for(int i(0); i < 10000000; ++i)
    {
        bits(rand());
    }
    t.stop();
    t.report("bits version");

    t.start();
    for(int i(0); i < 10000000; ++i)
    {
        bits_c(rand());
    }
    t.stop();
    t.report("c version");

    t.start();
    for(int i(0); i < 10000000; ++i)
    {
        bits_asm(rand());
    }
    t.stop();
    t.report("c version");

    return 0;
}

Compiled with g++ with this command line:

c++ -msse4.2 -O2 -o bits -c bits.cpp

Although you may think that the -msse4.2 could be the problem with the log2() version, I tried without it and log2() is slower then.

Btw, I do not recommend this method since it is not portable. Only the Intel based computers will understand those instructions.

Alexis Wilke
  • 19,179
  • 10
  • 84
  • 156
5

Replace your firstBit function with an intrinsic using the BSF or BSR instruction for a massive speedup.

In gcc, it'd be __builtin_ffsll and __builtin_ctzll

With Visual C+, _BitScanForward and _BitScanReverse

Ben Voigt
  • 277,958
  • 43
  • 419
  • 720
  • Hmm... Another really interesting suggestion I've though of myself. However there is an issue : I'm using a Mac and compiling with a clang++ compiler (the only one I got to fully support Boost and C++11). Any ideas? Is there an equivalent? – Dr.Kameleon Dec 30 '12 at 00:38
  • @Dr.Kameleon: clang supports these under the gcc names: http://comments.gmane.org/gmane.comp.compilers.clang.devel/2074 – Ben Voigt Dec 30 '12 at 01:03
  • 1
    According to lore, the NSA will not buy computers without these instructions. – brian beuning Dec 30 '12 at 01:35
  • Another weird thing : just tested *my* version, using `__builtin_ffsll` for getting the first bit (instead of De Bruijn multiplication) - which has been tested and working with Clang - and was expecting it go fast. The thing is... it didn't. My version is about as fast as before, and @icepack's version remains the fastest. – Dr.Kameleon Dec 30 '12 at 09:27
  • Ok, that is really strange. – Ben Voigt Dec 30 '12 at 20:48
3

The fastest I can think of right now would be using a pre-generated map array of all numbers (it doesn't have to be all numbers, you can for example break the numbers in 8-bit or 16-bit parts and then concatenate the returned arrays with some proper additions to account for the actual position of the bits).

Luchian Grigore
  • 253,575
  • 64
  • 457
  • 625
  • I can't imagine that concatenation is actually faster than the simple bit-twiddling. Even the cache hit for the table lookup is going to be slower. – Ben Voigt Dec 29 '12 at 23:38
  • @BenVoigt yeah, I figured the map wasn't a good suggestion, but it can also be done with an array. – Luchian Grigore Dec 29 '12 at 23:39
  • A lookup table is still likely to have worse performance, because of the cache space it takes up. Bit shifts are cheap. – Ben Voigt Dec 29 '12 at 23:40
3
const size_t size = sizeof(U64)*8;
U64 val = 1;

vector<UINT> res;
res.reserve(size);

for ( size_t i = size; i > 0; --i ) {
  if ( ( val & bitboard ) != 0 ) {
    res.push_back(i);
  }
  val <<= 1;
}
Naszta
  • 7,560
  • 2
  • 33
  • 49
3

I tried a naive version, which clocked in around 2-3x faster, but reserved()'d the vector first. On applying the reserve to the original algorithm, it beat the naive one.

So I suspect that the vector operations are the bigger cost here, rather than the bit manipulation (or even the multiply used in the next-bit function).

There are some other speed-ups to be had, regarding finding the lowest set bit. My local log2 version was poor, and the function given in the original post is not super cheap.

Here's my best attempt:

void bits(uint64 bitboard, vector<int> &res)
{
    res.resize(64);
    int i = 0;
    while (bitboard)
    {
        int first;
        _BitScanForward64((DWORD *)&first, bitboard);
        res[i++] = first;
        bitboard &= ~(1ULL<<first);
    }
    res.resize(i);
}

Replaced the firstBit function with an asm intrinsic. Using the intrinsic gave a big boost here. (This is obviously not portable code, though I suspect a GCC variant shouldn't be too tricky).

Also assumes the vector is reasonably persistent, rather than being dynamically allocated/copied all the time, and just resizes it appropriately.

JasonD
  • 16,464
  • 2
  • 29
  • 44
  • I tried "reserving" the vector, in almost all versions and the effect is unnoticeable if any... - what would you suggest if we put the vectors aside? – Dr.Kameleon Dec 29 '12 at 23:51
  • Interesting. Makes a big difference here. – JasonD Dec 29 '12 at 23:52
  • Getting rid of the vectors completely, and doing bitboard iteration whenever you would iterate the vector, is probably a much bigger win. – Ben Voigt Dec 29 '12 at 23:58
  • Yeah, I can take a reasonable chunk out just by moving to an old fashioned array, rather than a vector. – JasonD Dec 30 '12 at 00:01
2

I actually think the fastest, simplest method is simply to loop around, but if we pass in a vector instead of later making a copy, it should be a little faster.

void DQBitboard::bits(U64 bitboard, vector<UINT> &res)
{
    res.clear();   // Make sure vector is empty. Not necessary if caller does this!
    int bit = 0;
    while (bitboard)
    {
        if (bitboard & 1) 
            res.push_back(bit);
        bit++;
        bitboard >>= 1;
    }

    return res;
}

The multiply in the findfirst will cost a bit, so I doubt it's really worth it.

Mats Petersson
  • 126,704
  • 14
  • 140
  • 227