Find the first instance of a character using simd

Question

I am trying to find the first instance of a character, in this case '"' using simd (AVX2 or earlier). I'd like to use _mm256_cmpeq_epi8, but then I need a quick way of finding if any of the resulting bytes in the __m256i have been set to 0xFF. The plan was then to use _mm256_movemask_epi8 to convert the result from bytes to bits, and the to use ffs to get a matching index. Is it better to move out a portion at a time using _mm_movemask_epi8? Any other suggestions?

I should add, simd isn't necessary, in general I am just looking for the fastest approach. Perhaps some bit magic? — Jimbo, Dec 01 '16 at 16:10
Your basic idea is sound - I have a feeling that there may already be a SIMD implementation much as you describe in a previous question on StackOverflow, but a quick search didn't turn it up. Note that what you're implementing is effectively `strchr` (or `memchr` if you know the length), and there may well already be SIMD-optimised implementations of this available. Note also that for strings that are not already in cache your function may well be memory bandwidth limited. — Paul R, Dec 01 '16 at 16:25
[Here's an SSE implementation which scans a string for a `'\0'`](http://stackoverflow.com/a/14524319/253056) (effectively `strlen`), which you might be able to adapt. — Paul R, Dec 01 '16 at 16:28
Relevant post: https://stackoverflow.com/questions/47245773/why-is-strchr-twice-as-fast-as-my-simd-code — Jimbo, Nov 12 '17 at 05:33

Peter Cordes · Accepted Answer · 2022-10-30T17:48:48.073

You have the right idea with _mm256_cmpeq_epi8 -> _mm256_movemask_epi8. AFAIK, that's the optimal way to implement this for Intel CPUs at least. PMOVMSKB r32, ymm is the same speed as the XMM 16-byte version, so it would be a huge loss to unpack the two lanes of a 256b vector and movemask them separately and then recombine the integer results. (Source: Agner Fog's instruction table. See other perf links in the x86 tag wiki.)

Make the code inside the loop as efficient as possible by leaving the ffs until after you've identified a non-zero result from _mm256_movemask_epi8.

TEST/JCC can macro fuse into a single uop, but BSF/JCC doesn't, so it takes an extra instruction. (And you'd be hard-pressed to get a C compiler to emit BSF/JCC anyway. More likely branching on the result of ffs would give you some kind of test for the input being non-zero, then BSF, then add 1, then compare-and-branch. That's obviously horrible compared to just testing the movemask result.)

(Update, in C++20, use std::countr_zero. It can compile to a single tzcnt, instead of the off-by-one of ffs. Since you've already checked for the mask being non-zero, hopefully can optimize to a single (rep) bsf instruction if it isn't sure all CPUs running the code will support tzcnt. If you can assume BMI1 in your target CPUs, which you usually can for AVX2 code, then enable that so you'll reliably get an efficient tzcnt.)

Also note that for similar problems, comparing the movemask (e.g. to check that it's 0xFFFFFFFF) is just as efficient as branching on it being non-zero.

As Paul R suggested, looking at some strlen, strchr, and memchr implementations may be informative. There are multiple hand-written asm implementations in open-source libc implementations, and other places. (e.g. glibc, and Agner Fog's asmlib.)

Many of glibc's versions scan up to an alignment boundary, then use an unrolled loop that reads 64B at a time (in 4 SSE vectors, since I don't think glibc has an AVX2 version).

To optimize for long strings, reduce overhead from testing the compare results by ORing the compare results together, and check that. If you find a hit, go back and re-test your vectors to see which vector had the hit.

It may be somewhat more efficient to do the ffs on one 64-bit integer that you built up out of multiple movemask results (with shift and |). I'm not sure about doing this inside the loop before testing for zero; I don't remember if one of glibc's strlen strategies did that or not.

Everything I've suggested here is stuff can be seen in asm in various glibc strategies for strlen, memchr, and related functions. Here's sysdeps/x86_64/strlen.S, but I there may be another source file somewhere using more than baseline SSE2. (Or not, I might be thinking of a different function, maybe there's nothing to be gained beyond SSE2, until AVX (3-operand insns) and AVX2 (256b integer vectors).

Find the first instance of a character using simd

1 Answers1

Linked

Related