Convert array of bits to set faster

Question

Input is a bitarray stored in contiguous memory with 1 bit of the bitarray per 1 bit of memory.

Output is an array of the indices of set bits of the bitarray.

Example:

bitarray: 0000 1111 0101 1010
setA: {4,5,6,7,9,11,12,14}
setB: {2,4,5,7,9,10,11,12}

Getting either set A or set B is fine. The set is stored as an array of uint32_t so each element of the set is an unsigned 32 bit integer in the array.

How to do this about 5 times faster on a single cpu core?

current code:

#include <iostream>
#include <vector>
#include <time.h>

using namespace std;

template <typename T>
uint32_t bitarray2set(T& v, uint32_t * ptr_set){
    uint32_t i;
    uint32_t base = 0;
    uint32_t * ptr_set_new = ptr_set;
    uint32_t size = v.capacity();
    for(i = 0; i < size; i++){
        find_set_bit(v[i], ptr_set_new, base);
        base += 8*sizeof(uint32_t);
    }
    return (ptr_set_new - ptr_set);
}

inline void find_set_bit(uint32_t n, uint32_t*& ptr_set, uint32_t base){
    // Find the set bits in a uint32_t
    int k = base;
    while(n){
        if (n & 1){
            *(ptr_set) = k;
            ptr_set++;
        }
        n = n >> 1;
        k++;
    }
}

template <typename T>
void rand_vector(T& v){
    srand(time(NULL));
    int i;
    int size = v.capacity();
    for (i=0;i<size;i++){
        v[i] = rand();
    }
}

template <typename T>
void print_vector(T& v, int size_in = 0){
    int i;

    int size;
    if (size_in == 0){
        size = v.capacity();
    } else {
        size = size_in;
    }
    for (i=0;i<size;i++){
        cout << v[i] << ' ';
    }
    cout << endl;
}

int main(void){
    const int test_size = 6000;
    vector<uint32_t> vec(test_size);
    vector<uint32_t> set(test_size*sizeof(uint32_t)*8);
    rand_vector(vec);
    //for (int i; i < 64; i++) vec[i] = -1;
    //cout << "input" << endl;
    print_vector(vec);
    //cout << "calculate result" << endl;

    int i;
    int rep = 10000;
    uint32_t res_size;

    struct timespec tp_start, tp_end;
    clock_gettime(CLOCK_MONOTONIC, &tp_start);
    for (i=0;i<rep;i++){
        res_size = bitarray2set(vec, set.data());
    }
    clock_gettime(CLOCK_MONOTONIC, &tp_end);
    double timing;
    const double nano = 0.000000001;

    timing = ((double)(tp_end.tv_sec  - tp_start.tv_sec )
           + (tp_end.tv_nsec - tp_start.tv_nsec) * nano) /(rep);

    cout << "timing per cycle: " << timing << endl;
    cout << "print result" << endl;
    //print_vector(set, res_size);
}

result (compiled with icc -O3 code.cpp -lrt)

...
timing per cycle: 0.000739613 (7.4E-4).
print result

0.0008 seconds to convert 768000 bits to set. But there are at least 10,000 arrays of 768,000 bits in each cycle. That is 8 seconds per cycle. That is slow.

The cpu has popcnt instruction and sse4.2 instruction set.

Thanks.

Update

template <typename T>
uint32_t bitarray2set(T& v, uint32_t * ptr_set){
    uint32_t i;
    uint32_t base = 0;
    uint32_t * ptr_set_new = ptr_set;
    uint32_t size = v.capacity();
    uint32_t * ptr_v;
    uint32_t * ptr_v_end = &(v[size]);
    for(ptr_v = v.data(); ptr_v < ptr_v_end; ++ptr_v){
        while(*ptr_v) {
           *ptr_set_new++ = base + __builtin_ctz(*ptr_v);
           (*ptr_v) &= (*ptr_v) - 1;  // zeros the lowest 1-bit in n
        }
        base += 8*sizeof(uint32_t);
    }
    return (ptr_set_new - ptr_set);
}

This updated version uses the inner loop provided by rhashimoto. I don't know if the inlining actually makes the function slower (i never thought that can happen!). The new timing is 1.14E-5 (compiled by icc -O3 code.cpp -lrt, and benchmarked on random vector).

Warning:

I just found that reserving instead of resizing a std::vector, and then write directly to the vector's data through raw pointing is a bad idea. Resizing first and then use raw pointer is fine though. See Robᵩ's answer at Resizing a C++ std::vector<char> without initializing data I am going to just use resize instead of reserve and stop worrying about the time that resize wastes by calling constructor of each element of the vector... at least vectors actually uses contiguous memory, like a plain array (Are std::vector elements guaranteed to be contiguous?)

Would you be able to trade off space for time? Table lookup could do it. Consider 8 tables - one for each 4-bit fields of the 32-bit word - of 16 entries each, where each entry is an array (or vector) of the indices for that 4-bit pattern of the uint32. — davidbak, Jul 12 '16 at 22:05
I think so. That is a good idea. How to use that table? do you mean to do bitwise & between the 32bit word from the input, and the key of the tables? — rxu, Jul 12 '16 at 22:10
for each of 8 4-bit fields: grab the field, use it as an index into an 8-element array of a 16-element array of variable length vectors, append all items in that vector to your set of indexes being accumulated. Or you could use 1 256-entry vector for all 4 bytes, but you'd have to add 0/8/16/24 to the indexes as you added them to your result vector. Given your hint that you can use SSE4 I'm hoping someone answers with a cool SIMD approach. Why not add [sse] as a tag? — davidbak, Jul 12 '16 at 22:16
@davidbak: I don't think SSE4 helps (just SSE2 for vector integer add and shift), but popcnt does to solve the variable-length store problem. — Peter Cordes, Jul 14 '16 at 06:03
re:resizing: reserve space and use `emplace_back` to construct new elements in-place as they are created. — Peter Cordes, Jul 14 '16 at 18:59
@davidbak: oops, just add, not shift. I was generating a set of 2^pos, not just pos. Updated my answer with a bugfix. — Peter Cordes, Jul 14 '16 at 19:20
@Peter Cords: thanks for the suggestion. I haven't used emplace_back before and want to know more about it. If I reserve enough space, would emplace_back never cause any reallocation of memory? Is emplace back as fast as using pointer to directly read/write the elements? Reallocation is not ok because I am using the vector in openmp in other inline functions, and might switch to pthread later. — rxu, Jul 14 '16 at 19:26
It's a lot like push_back, but constructs in place instead of copying. If you reserve space ahead of time, the vector won't need to reallocate. — Peter Cordes, Jul 14 '16 at 19:54
i see. I was warned about push_back (http://stackoverflow.com/questions/3664272/is-stdvector-so-much-slower-than-plain-arrays). I guess I will use emplace_back for loading data from the hard disk, after getting the filesize and reserve space in memory. And then use pointer to do the bitarray2set or other calculation. — rxu, Jul 14 '16 at 20:06

score 6 · Accepted Answer · edited May 23 '17 at 12:34

I notice that you use .capacity() when you probably mean to use .size(). That could make you do extra unnecessary work, as well as giving you the wrong answer.

Your loop in find_set_bit() iterates over all 32 bits in the word. You can instead iterate only over each set bit and use the BSF instruction to determine the index of the lowest bit. GCC has an intrinsic function __builtin_ctz() to generate BSF or the equivalent - I think that the Intel compiler also supports it (you can inline assembly if not). The modified function would look like this:

inline void find_set_bit(uint32_t n, uint32_t*& ptr_set, uint32_t base){
    // Find the set bits in a uint32_t
    while(n) {
       *ptr_set++ = base + __builtin_ctz(n);
       n &= n - 1;  // zeros the lowest 1-bit in n
    }
}

On my Linux machine, compiling with g++ -O3, replacing that function drops the reported time from 0.000531434 to 0.000101352.

There are quite a few ways to find a bit index in the answers to this question. I do think that __builtin_ctz() is going to be the best choice for you, though. I don't believe that there is a reasonable SIMD approach to your problem, as each input word produces a variable amount of output.

Thank you for this answer. It helped a lot. The program becomes 3x faster on the machine, after using your inner loop. I also found the outer loop is kind of slow, and improved that in the updated version. — rxu, Jul 13 '16 at 15:04
Variable-size output per chunk isn't a problem for SIMD, with the help of popcnt. See my answer. — Peter Cordes, Jul 14 '16 at 06:02
I just found that reserving instead of resizing a std::vector, and then write directly to the vector's data through raw pointing is a bad idea. Resizing first and then use raw pointer is fine though. See Robᵩ's answer at http://stackoverflow.com/questions/7689406/resizing-a-c-stdvectorchar-without-initializing-data I am going to just use resize instead of reserve and stop worrying about the time that resize wastes by calling constructor of each member. — rxu, Jul 14 '16 at 16:20

score 1 · Answer 2 · edited May 23 '17 at 11:47

As suggested by @davidbak, you could use a table lookup to process 4 elements of the bitmap at once.

Each lookup produces a variable-sized chunk of set members, which we can handle by using popcnt.

@rhashimoto's scalar ctz-based suggestion will probably do better with sparse bitsets that have lots of zeros, but this should be better when there are a lot of set bits.

I'm thinking something like

// a vector of 4 elements for every pattern of 4 bits.
// values range from 0 to 3, and will have a multiple of 4 added to them.
alignas(16) static const int LUT[16*4] = { 0,0,0,0,  ... };

// mostly C, some pseudocode.
unsigned int bitmap2set(int *set, int input) {
    int *set_start = set;

    __m128i offset = _mm_setzero_si128();

    for (nibble in input[]) {  // pseudocode for the actual shifting / masking
        __m128i v = _mm_load_si128(&LUT[nibble]);
        __m128i vpos = _mm_add_epi32(v, offset);

        _mm_store((__m128i*)set, vpos);

        set += _mm_popcount_u32(nibble);    // variable-length store
        offset = _mm_add_epi32(offset, _mm_set1_epi32(4));  // increment the offset by 4
    }
    return  set - set_start;  // set size
}

When a nibble isn't 1111, the next store will overlap, but that's fine.

Using popcnt to figure out how much to increment a pointer is a useful technique in general for left-packing variable-length data into a destination array.

Thank you for this answer. I will try this one after finishing writing other parts of the program. — rxu, Jul 15 '16 at 15:02

Convert array of bits to set faster

2 Answers2