If I have very large array of bytes and want to find indices of all 1-bits, indices counting from leftmost bit, how do I do this efficiently, probably using SIMD.
(For finding the first 1-bit, see an earlier question. This question produces an array of outputs instead of 1 index.)
Of course I can do following 512-bit non-SIMD version using C++20:
Try it online!
#include <cstdint>
#include <iostream>
#include <bit>
int Find1s512(uint64_t const * p, uint16_t * idxs) {
int rpos = 0;
for (int i = 0; i < 8; ++i) {
uint64_t n = p[i];
while (true) {
int const j = std::countr_zero(n);
if (j >= 64)
break;
idxs[rpos++] = i * 64 + j;
n &= n - 1;
}
}
return rpos;
}
int main() {
uint64_t a[8] = {(1ULL << 17) | (1ULL << 63),
1ULL << 19, 1ULL << 23};
uint16_t b[512] = {};
int const cnt = Find1s512(a, b);
for (int i = 0; i < cnt; ++i)
std::cout << int(b[i]) << " ";
// Printed result: 17 63 83 151
}
And use above 512-bit version as building block to collect 1-bit positions of whole large array.
But I'd like to find out what is the most efficient way to do this, especially using SIMD 128/256/512.