Efficiently find indices of 1-bits in large array, using SIMD

Question

If I have very large array of bytes and want to find indices of all 1-bits, indices counting from leftmost bit, how do I do this efficiently, probably using SIMD.

(For finding the first 1-bit, see an earlier question. This question produces an array of outputs instead of 1 index.)

Of course I can do following 512-bit non-SIMD version using C++20:
Try it online!

#include <cstdint>
#include <iostream>
#include <bit>

int Find1s512(uint64_t const * p, uint16_t * idxs) {
    int rpos = 0;
    for (int i = 0; i < 8; ++i) {
        uint64_t n = p[i];
        while (true) {
            int const j = std::countr_zero(n);
            if (j >= 64)
                break;
            idxs[rpos++] = i * 64 + j;
            n &= n - 1;
        }
    }
    return rpos;
}

int main() {
    uint64_t a[8] = {(1ULL << 17) | (1ULL << 63),
        1ULL << 19, 1ULL << 23};
    uint16_t b[512] = {};
    int const cnt = Find1s512(a, b);
    for (int i = 0; i < cnt; ++i)
        std::cout << int(b[i]) << " ";
    // Printed result: 17 63 83 151
}

And use above 512-bit version as building block to collect 1-bit positions of whole large array.

But I'd like to find out what is the most efficient way to do this, especially using SIMD 128/256/512.

`uint8_t` has a value range of 0..255. Even one 512-bit chunk has half its bits at higher indices than that. You probably want `uint16_t`. Do you need the bits to be in ascending order? If not, you could get the lowest-set position within word or dword chunks and left-pack (`vpcompressw`) until all SIMD elements became `0`. So you might get an output like `1, 33, 66, 4, 37, 69, ...` if the input happened to have some set bits in each of the first 3 dword elements but none in higher elements. — Peter Cordes, Nov 08 '22 at 06:37
AVX-512 has `vplzcntd` (32-bit elements), which you can use for SIMD tzcnt by isolating the lowest set bit with a 2-operation bithack (like blsi), [as in this](//stackoverflow.com/a/58567658). Or you can use the blsmsk bithack `(SRC-1) XOR (SRC)` to make a mask up to the lowest set bit and `vpopcntb/w/d/q` it, allowing you to work with narrower chunks than dword assuming you have AVX-512BITALG (Ice Lake). That may be better, since you'd already have to separately test-into-mask to reject all-zero elements, so it doesn't matter that they come out the same as elements with their high bit set. — Peter Cordes, Nov 08 '22 at 06:41
See also [Trying to write a vectorized implementation of Gerd Isenberg's Bit Scan Forward as an exercise](https://stackoverflow.com/a/73949263) for SIMD BSF/TZCNT for 16-bit elements with SSSE3 or AVX2 using a De Bruijn sequence and `pshufb`. Without AVX-512 `vplzcntd` or `vpopcntw`, that's a good option, only slightly slower. AVX-512 for left-packing is *very* helpful though; without it you're stuck with shenanigans like [AVX2 what is the most efficient way to pack left based on a mask?](https://stackoverflow.com/q/36932240) — Peter Cordes, Nov 08 '22 at 06:48

Efficiently find indices of 1-bits in large array, using SIMD

0 Answers0