0

If I have array of 16 or 32 or 64 bytes (let's suppose aligned on 64-bytes memory boundary), how do I quickly find index of first byte equal to given, using SIMD SSE2/AVX/AVX2/AVX-512. If such byte not exist for example you can return index equal to size of array.

uint8_t a[16];
size_t index = FindByte(a, 0x37);

of course without SIMD it can be implemented with simple loop

Try it online!

template <size_t Size>
size_t FindByte(uint8_t const (&arr)[Size], uint8_t b) {
    for (size_t i = 0; i < Size; ++i)
        if (arr[i] == b)
            return i;
    return Size;
}

These above are just toy examples. In real application I have to check not a single array but millions of arrays per second. That's why I need to do this very fast, hence I was thinking of doing this using SIMD instructions. Also multi-threading here will not solve the problem because I have sequential queries to FindByte(), I can't run them in parallel.

I tried to implement following solution with SIMD SSE2 and two precomputed tables:

Try it online!

#include <cstdint>
#include <iostream>
#include <array>
#include <emmintrin.h>

template <size_t Size>
size_t FindByte(uint8_t const (&arr)[Size], uint8_t b) {
    static auto const precompute = []{
        std::array<__m128i, 256> val{};
        for (size_t i = 0; i < 256; ++i)
            val[i] = _mm_set1_epi8(i);
        std::array<uint8_t, 1 << 16> pos{};
        for (size_t i = 0; i < (1 << 16); ++i) {
            size_t j = 0;
            for (j = 0; j < 16; ++j)
                if (i & (1 << j))
                    break;
            pos[i] = j;
        }
        return std::make_pair(std::move(val), std::move(pos));
    }();
    auto const cmp = _mm_cmpeq_epi8(
        _mm_loadu_si128((__m128i*)&arr),
        precompute.first[b]);
    return precompute.second[_mm_movemask_epi8(cmp)];
}

int main() {
    uint8_t arr[16] = {3, 5, 7, 11, 13, 17, 19};
    std::cout << FindByte(arr, 17) << " "
        << FindByte(arr, 23) << std::endl;
}

But this solution above might be non-optimal, because 64 KB table may give L1 cache misses, it only fits into L2 cache.

Also I expect that it can be solved without precomputed tables, just with SIMD instructions alone.

Also I need not only SSE2 solution that I did above, but also AVX/AVX2/AVX512 for the case of 32/64 bytes array.


Update. As question is closed, posting here my final working-fast solution for SSE2 that I figured out with the help of comments of @PeterCordes:

Try it online!

#include <cstdint>
#include <iostream>
#include <bit>

#include <emmintrin.h>

size_t FindByte(uint8_t const * arr16, uint8_t b) {
    return std::countr_zero(uint16_t(
        _mm_movemask_epi8(_mm_cmpeq_epi8(
            _mm_loadu_si128((__m128i*)arr16), _mm_set1_epi8(b)))
    ));
}

int main() {
    uint8_t arr[16] = {3, 5, 7, 11, 13, 17, 19};
    std::cout << FindByte(arr, 17) << " "
        << FindByte(arr, 23) << std::endl;
}

Console output:

5 16
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Arty
  • 14,883
  • 6
  • 36
  • 69
  • 2
    Use `std::countr_zero` (aka `bsf` or `tzcnt`) on a movemask result. [Find the first instance of a character using simd](https://stackoverflow.com/q/40915243) links some examples: glibc `memchr` – Peter Cordes Oct 22 '22 at 20:07
  • @PeterCordes Thanks! If CPU doesn't have special instruction to find first 1, how then `std::countr_zero` implemented? Using table of maybe some strange bit twidling math? – Arty Oct 22 '22 at 20:11
  • 2
    x86 *does* have a special instruction for that since 386. And it's fast since P6, so if SSE2 is available, you have decently fast (rep) `bsf`. See the bullet list of x86 history of support for this in the early part of my half-finished answer on [Set the leading zero bits in any size integer C++](https://stackoverflow.com/a/73834121) – Peter Cordes Oct 22 '22 at 20:14
  • @PeterCordes Besides using `std::countr_zero` is it the only optimization that is necessary for my function? Maybe there is some other SIMD solution, faster than mine? Also does my solution extend easily on AVX2/AVX512, or they don't have exactly same instructions? – Arty Oct 22 '22 at 20:14
  • @PeterCordes Also one more question - do I need to precompute table of `_mm_set1_epi8(byte)`? Or it is faster to inline instruction itself? As table is just 256 elements in size, it fits L1 cache. – Arty Oct 22 '22 at 20:16
  • 2
    You definitely shouldn't use a lookup table to broadcast a byte to a vector!! At worst it's a few shuffles; with SSSE3 or later it's `movd` + one instruction. With AVX-512 it's one instruction with a GPR source. – Peter Cordes Oct 22 '22 at 20:21
  • 1
    What you're implementing is just `memchr`; you should get reasonable performance from a compiler / libc for this. For small fixed sizes (especially a small multiple of the vector width), you might be able to gain some by not having the library function branch on the size. But its basic strategy is still good, and involves `pcmpeqb` / `pmovmskb` / `rep bsf` on the non-zero result. See also [Why is this code using strlen heavily 6.5x slower with GCC optimizations enabled?](https://stackoverflow.com/a/55589634) for a simple hand-written asm loop example with benchmarks. – Peter Cordes Oct 22 '22 at 20:30
  • 1
    (glibc `rawmemchr` can omit the check for getting to the end without a match, making it like strlen.) https://godbolt.org/z/5ffP6G3T6 shows that GCC and clang unfortunately don't inline `memchr` even for an aligned pointer with a constant size of 16 or 32 for `-march=skylake` where it should just be 5 instructions: vmovd / vpbroadcastb then vpcmpeqb / vpmovmskb / tzcnt. (All of them single-uop with at most 3 cycle latency.) So could well be worth manually vectorizing. – Peter Cordes Oct 22 '22 at 21:01
  • 1
    @wovano - Normally we don't want people to edit an answer into the question, but I think it was reasonable in this case to simply put the pieces together from the disparate linked duplicates. I rolled back to the version that had the OP's code. Maybe I should even have reopened the question so they could post it as an answer, since none of the duplicates had code precisely like this, using set1 to do `memchr`, and especially using modern C++ `std::countr_zero`, but hopefully any beginners that need this will still be able to find it. – Peter Cordes Oct 30 '22 at 17:42
  • Updated my answer on one of the linked duplicates to mention `std::countr_zero`. – Peter Cordes Oct 30 '22 at 17:49

0 Answers0