How to speed up bit testing

Question

I'm pondering at how to speed up bit testing in the following routine:

void histSubtractFromBits(uint64* cursor, uint16* hist){
    //traverse each bit of the 256-bit-long bitstring by splitting up into 4 bitsets
    std::bitset<64> a(*cursor);
    std::bitset<64> b(*(cursor+1));
    std::bitset<64> c(*(cursor+2));
    std::bitset<64> d(*(cursor+3));
    for(int bit = 0; bit < 64; bit++){
        hist[bit] -= a.test(bit);
    }
    for(int bit = 0; bit < 64; bit++){
        hist[bit+64] -= b.test(bit);
    }
    for(int bit = 0; bit < 64; bit++){
        hist[bit+128] -= c.test(bit);
    }
    for(int bit = 0; bit < 64; bit++){
        hist[bit+192] -= d.test(bit);
    }
}

The actual gcc implementation does a range-check for the bit argument, then &-s with a bitmask. I could do it without the bitsets and with my own bit-shifting / masking, but I'm fairly certain that won't yield any significant speedup (tell me if I'm wrong and why).

I'm not really familiar with the x86-64 assembly, but I am aware of a certain bit test instruction, and I am aware that it's theoretically possible to do inline assembly with gcc.

1) Do you think it at all worthwhile to write an inline-assembly analogue for the above code?

2) If yes, then how would I go about doing it, i.e. could you show me some basic starter code / samples to point me in the right direction?

Making this microoptimization doesn'ty make any sense, unless *measurements* show this is performance-relevant. I doubt it very much, I'm sure your performance bottlenecks are elsewhere. — vonbrand, Apr 28 '14 at 20:41
Does this really say `hist[...] -= ...;` not `hist[...] += ...` ? — FrankH., Apr 29 '14 at 12:16
@FrankH. I have two versions of the function, one has "-=" and is called "histSubtractFromBits", the other has "+=" and is called "histAddToBits". Both are important and performance-critical. — Greg Kramida, Apr 29 '14 at 14:21
@vonbrand, I believe it's relevant, since we're talking about adding /subtracting up literally billions of these bitstrings during a single run on a single data item. The other things are already optimized (on the GPU w/ OpenCl). This part it seems like it makes more sense to do on the CPU. — Greg Kramida, Apr 29 '14 at 14:22
See also http://stackoverflow.com/questions/12985949/methods-to-vectorise-histogram-in-simd and http://stackoverflow.com/questions/10316708/ios-glsl-is-there-a-way-to-create-an-image-histogram-using-a-glsl-shader?lq=1 for some references. I _suspect_ that what you're actually trying to do is to histogram an _array_ / _vector_ of `bitset<256>`, i.e. that `histSubtractFromBits()` is actually called in a loop ? The best code of that will look very different from the best code to do a _single_ step. — FrankH., Apr 29 '14 at 14:26
@FrankH., you're right, it's a histogram of sorts, or rather - multiple histograms, one for each region centered at each pixel in the image. Here's the thing, though: it's a sliding histogram of "responses", not of "occurences." To be more specific, it's an optimized and altered version of MPEG7 Color Structure descriptor. Storing the entire output data doesn't seem to be feasible, but storing partial results in bit-strings for each *window* (note, not each "region") seems feasible. Hence I need very quick way to get the actual descriptors from these. — Greg Kramida, Apr 29 '14 at 14:45

Jester · Accepted Answer · 2014-04-28T18:45:55.093

As far as I can tell, you basically iterate over each bit. As such, I'd imagine simply shifting and masking off the LSB every time should provide good performance. Something like:

uint64_t a = *cursor;
for(int bit = 0; a != 0; bit++, a >>= 1) {
    hist[bit] -= (a & 1);
}

Alternatively, if you expect only very few bits to be set and are happy with gcc specific stuff, you could use __builtin_ffsll

uint64_t a = *cursor;
int next;
for(int bit = 0; (next = __builtin_ffsll(a)) != 0; ) {
    bit += next;
    hist[bit - 1] -= 1;
    a >>= next;
}

The idea should be fine, but no warranty for the actual code :)

Update: code using vector extensions:

typedef short v8hi __attribute__ ((vector_size (16)));

static v8hi table[256];

void histSubtractFromBits(uint64_t* cursor, uint16_t* hist)
{
    uint8_t* cursor_tmp = (uint8_t*)cursor;
    v8hi* hist_tmp = (v8hi*)hist;
    for(int i = 0; i < 32; i++, cursor_tmp++, hist_tmp++)
    {
        *hist_tmp -= table[*cursor_tmp];
    }
}

void setup_table()
{
    for(int i = 0; i < 256; i++)
    {
        for(int j = 0; j < 8; j++)
        {
            table[i][j] = (i >> j) & 1;
        }
    }
}

This will be compiled to SSE instructions if available, for example I get:

        leaq    32(%rdi), %rdx
        .p2align 4,,10
        .p2align 3
.L2:
        movzbl  (%rdi), %eax
        addq    $1, %rdi
        movdqa  (%rsi), %xmm0
        salq    $4, %rax
        psubw   table(%rax), %xmm0
        movdqa  %xmm0, (%rsi)
        addq    $16, %rsi
        cmpq    %rdx, %rdi
        jne     .L2

Of course this approach relies on the table being in cache.

If you expect a lot of bits to be set it may be interesting to try SSE. Code coming up soon ;) — Jester, Apr 28 '14 at 18:27
Thanks everyone, sorry for the delay, I ran into another issue at my 2nd job, currently working on fixing that. Will get back to you tomorrow! — Greg Kramida, Apr 28 '14 at 21:49
Time to end my obsessive optimization streak / perf. testing. Compiling with O3 and running on larger inputs produced quite different results from before. Here are the test results: original code - 1x speedup; simple LSB mask - 1.93X speedup; ffs - 2.0X speedup; Thomas Matthew's LSB mask version (unrolling,registers) - 2.25X speedup; vector extension version - **8.54X** speedup. Thanks again for the help. — Greg Kramida, Apr 30 '14 at 15:51

score 2 · Answer 2 · answered Apr 28 '14 at 20:08

Another suggestion is to combine data caching, registers and loop unrolling:

// Assuming your processor has 64-bit words
void histSubtractFromBits(uint64_t const * cursor, uint16* hist)
{
    register uint64_t a = *cursor++;
    register uint64_t b = *cursor++;
    register uint64_t c = *cursor++;
    register uint64_t d = *cursor++;
    register unsigned int i = 0;
    for (i = 0; i < (sizeof(*cursor) * CHAR_BIT; ++i)
    {
        hist[i +   0] += a & 1;
        hist[i +  64] += b & 1;
        hist[i + 128] += c & 1;
        hist[i + 192] += d & 1;
        a >>= 1;
        b >>= 1;
        c >>= 1;
        d >>= 1;
    }
}

I'm not sure if you gain any more performance by reordering the instructions like this:

    hist[i +   0] += a & 1;
    a >>= 1;

You could try both ways and compare the assembly language for both.

One of the ideas here is to maximize the register usage. The values to test are loaded into registers and then the testing begins.

How to speed up bit testing

2 Answers2