counting bits across multiple integers... is there a faster way?

Question

I have an array of 4 longs that I am wanting to count the number of set bits within a given range. This is the function that I am currently using (where bitcount(uint64_t) is an inline asm function that gives the number of set bits in the argument):

unsigned count_bits(const uint64_t *long_ptr, size_t begin, size_t end)
{
    uint64_t start_mask = ~((1L << (begin & 63)) - 1);
    uint64_t end_mask = ((1L << (end & 63)) - 1);
    if (begin >= 0 && begin < 64) {
        if (end < 64) {
            return bitcount(long_ptr[0] & start_mask & end_mask);
        } else if (end < 128) {
            return bitcount(long_ptr[0] & start_mask) + bitcount(long_ptr[1] & end_mask);
        } else if (end < 192) {
            return bitcount(long_ptr[0] & start_mask) + bitcount(long_ptr[1]) + bitcount(long_ptr[2] & end_mask);
        } else if (end<256) {
            return bitcount(long_ptr[0] & start_mask) + bitcount(long_ptr[1]) + bitcount(long_ptr[2]) + bitcount(long_ptr[3] & end_mask);
        } else {
            return bitcount(long_ptr[0] & start_mask) + bitcount(long_ptr[1]) + bitcount(long_ptr[2]) + bitcount(long_ptr[3]);
        }
    } else if (begin >= 64 && begin < 128) {
        if (end < 128) {
            return bitcount(long_ptr[1] & start_mask & end_mask);
        } else if (end < 192) {
            return bitcount(long_ptr[1] & start_mask) + bitcount(long_ptr[2] & end_mask);
        } else if (end < 256) {
            return bitcount(long_ptr[1] & start_mask) + bitcount(long_ptr[2]) + bitcount(long_ptr[3] & end_mask);
        } else {
            return bitcount(long_ptr[1] & start_mask) + bitcount(long_ptr[2]) + bitcount(long_ptr[3]);
        }
    } else if (begin >= 128 && begin < 192) {
        if (end < 192) {
            return bitcount(long_ptr[2] & start_mask & end_mask);
        } else if (end < 256) {
            return bitcount(long_ptr[2] & start_mask) + bitcount(long_ptr[3] & end_mask);
        } else {
            return bitcount(long_ptr[2] & start_mask) + bitcount(long_ptr[3]);
        }
    } else if (begin<256) {
        if (end < 256) {
            return bitcount(long_ptr[3] & start_mask & end_mask);
        } else {
            return bitcount(long_ptr[3] & start_mask);
        }
    } else {
        return 0;
    }
}

I am finding that performance of this code is quite good, but I am wondering if there is anything I can do to make it faster, or if a redesign of the algorithm could result in a performance boost.

[Fastest way to count bits](http://stackoverflow.com/q/14009765/995714) — phuclv, Dec 07 '15 at 02:15
I think your problem is how to count the bits set within a certain range of bits across multiple integers, rather than within simply 1 integer. Please make that very clear or people will skim your question and make assumptions about it. — Neil Kirk, Dec 07 '15 at 02:16
Since you seem to be asking for review of existing, working code; codereview.stackexchange.com would be a more appropriate site to post on — M.M, Dec 07 '15 at 02:27
It seems like your algorithm only works for 64-bit ints, so perhaps use `uint64_t` instead of `size_t` — M.M, Dec 07 '15 at 02:28
@M.M Little known fact: you can use `[SiteName.SE]` to auto-box a site name and link. ([CodeReview.SE]) :) — Der Kommissar, Dec 07 '15 at 02:30
good point... on my hardware size_t is uint64, but perhaps I should make that explicit — markt1964, Dec 07 '15 at 02:59
I believe size_t is an unsigned type, so you don't need to check if begin >= 0. Although an assert that `begin <= end` might be appropriate. Also, I think your elseifs do redundant checking of begin. You will never hit `else if (begin >= 64 && begin < 128)` if `begin < 64`. An assert that `end < 256` might be good too. — David Wohlferd, Dec 07 '15 at 03:40
I haven't run this, but if you have a test harness, what about something like: `assert(begin <= end && end < 256); uint32_t tb = begin / 64; uint32_t te = end / 64; uint32_t ret = 0; uint64_t r; for (r = long_ptr[tb] & start_mask; tb < te ; tb++, r = long_ptr[tb]) { ret += bitcount(r); } return ret + bitcount(r & end_mask);`? While OP code avoids the loop, I believe this has no more compares/jumps. And being smaller, may not pollute the code cache as much. Might be worth benchmarking, anyway. — David Wohlferd, Dec 07 '15 at 04:48
For words where you need to count only some of the set bits, shifting the input rather than generating a variable mask is a good way to get rid of the bits you don't want to count. You can probably adapt [my answer on a question about counting bits below a thresholdd in a 64bit bitset](http://stackoverflow.com/questions/34407437/what-is-the-efficient-way-to-count-set-bits-at-a-position-or-lower/34410357#34410357). It's nowhere near optimal directly for inputs larger than 64bits, but the general idea should be useful. Note that it also has a test that you don't need. — Peter Cordes, Mar 06 '16 at 03:17
If you're on x86, where unaligned pointers are fast, it may be worth starting at the right byte, to avoid a case where you count 7b of one element and 11b of the next element, instead of just doing one mask & count on an unaligned load. — Peter Cordes, Mar 06 '16 at 03:21

VladimirS · Answer 1 · 2016-03-05T20:46:52.033

2

I have created 2 different versions with zero branching and I believe David Wohlferd comment should be selected due compactness. I do not believe that none branching version would be any faster. Processor branch prediction may eliminate effectively jumps on consistent data. While none branching one will count bits 4 times all the time(unless SSE?). I am publishing here my second(very short) none branch version. First was complicated with bit computations.

unsigned bitcount2(const uint64_t *long_ptr, size_t begin, size_t end)
{
    uint64_t mask[] = { 0, 0, 0, ~((1ULL << (begin & 63)) - 1), -1LL, -1LL, -1LL, ((1ULL << (end & 63)) - 1), 0, 0, 0 };
    uint64_t* b_start = mask+(3 - begin / 64);
    uint64_t* b_end = mask + (7 - end / 64);
    return bitcount(long_ptr[0] & b_start[0] & b_end[0]) +
        bitcount(long_ptr[1] & b_start[1] & b_end[1]) +
        bitcount(long_ptr[2] & b_start[2] & b_end[2]) +
        bitcount(long_ptr[3] & b_start[3] & b_end[3]);
}

edited Mar 05 '16 at 20:46

answered Mar 05 '16 at 20:41

VladimirS

600
6
14

The `~((1ULL << (begin & 63)) - 1)` can't cross boundaries into a different array element. I don't think this works right when begin >= 64 or when end is outside a limited range. Perhaps you're trying to do address math to get an unaligned window into `mask`? That's not what your code does, because you're never casting `mask` to a `uint8_t`. – Peter Cordes Mar 06 '16 at 03:13
@Peter Cordes I did not cover boundary conditions when begin or end above 256. Obviously replacing size_t with uchar would be sufficient. Within 0-256 should work. I did not finish the boundary conditions because performance testing on my old windows box shows that David Wohlferd version is faster for random cases I build. – VladimirS Mar 06 '16 at 03:52
Oh, I think I see how it works now. I think I missed the fact that you're indexing `b_end` after setting it based on `end`. If the same `start` and `end` are used often, or with a pattern, then a branchy version will probably do better than this. Otherwise it looks fairly good. – Peter Cordes Mar 06 '16 at 03:58
SSSE3 pshufb or AVX2 vpshufb might do well, but getting the masking right is tricky. – Peter Cordes Mar 06 '16 at 04:01
I have tested on approximately 10 cases various regions and validated that there are no branching. I used https://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable so my performance measurement could be off. – VladimirS Mar 06 '16 at 04:03

counting bits across multiple integers... is there a faster way?

1 Answers1