If I have array of 16 or 32 or 64 bytes (let's suppose aligned on 64-bytes memory boundary), how do I quickly find index of first byte equal to given, using SIMD SSE2/AVX/AVX2/AVX-512. If such byte not exist for example you can return index equal to size of array.
uint8_t a[16];
size_t index = FindByte(a, 0x37);
of course without SIMD it can be implemented with simple loop
template <size_t Size>
size_t FindByte(uint8_t const (&arr)[Size], uint8_t b) {
for (size_t i = 0; i < Size; ++i)
if (arr[i] == b)
return i;
return Size;
}
These above are just toy examples. In real application I have to check not a single array but millions of arrays per second. That's why I need to do this very fast, hence I was thinking of doing this using SIMD instructions. Also multi-threading here will not solve the problem because I have sequential queries to FindByte()
, I can't run them in parallel.
I tried to implement following solution with SIMD SSE2 and two precomputed tables:
#include <cstdint>
#include <iostream>
#include <array>
#include <emmintrin.h>
template <size_t Size>
size_t FindByte(uint8_t const (&arr)[Size], uint8_t b) {
static auto const precompute = []{
std::array<__m128i, 256> val{};
for (size_t i = 0; i < 256; ++i)
val[i] = _mm_set1_epi8(i);
std::array<uint8_t, 1 << 16> pos{};
for (size_t i = 0; i < (1 << 16); ++i) {
size_t j = 0;
for (j = 0; j < 16; ++j)
if (i & (1 << j))
break;
pos[i] = j;
}
return std::make_pair(std::move(val), std::move(pos));
}();
auto const cmp = _mm_cmpeq_epi8(
_mm_loadu_si128((__m128i*)&arr),
precompute.first[b]);
return precompute.second[_mm_movemask_epi8(cmp)];
}
int main() {
uint8_t arr[16] = {3, 5, 7, 11, 13, 17, 19};
std::cout << FindByte(arr, 17) << " "
<< FindByte(arr, 23) << std::endl;
}
But this solution above might be non-optimal, because 64 KB table may give L1 cache misses, it only fits into L2 cache.
Also I expect that it can be solved without precomputed tables, just with SIMD instructions alone.
Also I need not only SSE2 solution that I did above, but also AVX/AVX2/AVX512 for the case of 32/64 bytes array.
Update. As question is closed, posting here my final working-fast solution for SSE2 that I figured out with the help of comments of @PeterCordes:
#include <cstdint>
#include <iostream>
#include <bit>
#include <emmintrin.h>
size_t FindByte(uint8_t const * arr16, uint8_t b) {
return std::countr_zero(uint16_t(
_mm_movemask_epi8(_mm_cmpeq_epi8(
_mm_loadu_si128((__m128i*)arr16), _mm_set1_epi8(b)))
));
}
int main() {
uint8_t arr[16] = {3, 5, 7, 11, 13, 17, 19};
std::cout << FindByte(arr, 17) << " "
<< FindByte(arr, 23) << std::endl;
}
Console output:
5 16