8

I am trying to find the first instance of a character, in this case '"' using simd (AVX2 or earlier). I'd like to use _mm256_cmpeq_epi8, but then I need a quick way of finding if any of the resulting bytes in the __m256i have been set to 0xFF. The plan was then to use _mm256_movemask_epi8 to convert the result from bytes to bits, and the to use ffs to get a matching index. Is it better to move out a portion at a time using _mm_movemask_epi8? Any other suggestions?

Paul R
  • 208,748
  • 37
  • 389
  • 560
Jimbo
  • 2,886
  • 2
  • 29
  • 45
  • I should add, simd isn't necessary, in general I am just looking for the fastest approach. Perhaps some bit magic? – Jimbo Dec 01 '16 at 16:10
  • 1
    Your basic idea is sound - I have a feeling that there may already be a SIMD implementation much as you describe in a previous question on StackOverflow, but a quick search didn't turn it up. Note that what you're implementing is effectively `strchr` (or `memchr` if you know the length), and there may well already be SIMD-optimised implementations of this available. Note also that for strings that are not already in cache your function may well be memory bandwidth limited. – Paul R Dec 01 '16 at 16:25
  • 1
    [Here's an SSE implementation which scans a string for a `'\0'`](http://stackoverflow.com/a/14524319/253056) (effectively `strlen`), which you might be able to adapt. – Paul R Dec 01 '16 at 16:28
  • Relevant post: https://stackoverflow.com/questions/47245773/why-is-strchr-twice-as-fast-as-my-simd-code – Jimbo Nov 12 '17 at 05:33

1 Answers1

11

You have the right idea with _mm256_cmpeq_epi8 -> _mm256_movemask_epi8. AFAIK, that's the optimal way to implement this for Intel CPUs at least. PMOVMSKB r32, ymm is the same speed as the XMM 16-byte version, so it would be a huge loss to unpack the two lanes of a 256b vector and movemask them separately and then recombine the integer results. (Source: Agner Fog's instruction table. See other perf links in the tag wiki.)

Make the code inside the loop as efficient as possible by leaving the ffs until after you've identified a non-zero result from _mm256_movemask_epi8.

TEST/JCC can macro fuse into a single uop, but BSF/JCC doesn't, so it takes an extra instruction. (And you'd be hard-pressed to get a C compiler to emit BSF/JCC anyway. More likely branching on the result of ffs would give you some kind of test for the input being non-zero, then BSF, then add 1, then compare-and-branch. That's obviously horrible compared to just testing the movemask result.)

(Update, in C++20, use std::countr_zero. It can compile to a single tzcnt, instead of the off-by-one of ffs. Since you've already checked for the mask being non-zero, hopefully can optimize to a single (rep) bsf instruction if it isn't sure all CPUs running the code will support tzcnt. If you can assume BMI1 in your target CPUs, which you usually can for AVX2 code, then enable that so you'll reliably get an efficient tzcnt.)

Also note that for similar problems, comparing the movemask (e.g. to check that it's 0xFFFFFFFF) is just as efficient as branching on it being non-zero.


As Paul R suggested, looking at some strlen, strchr, and memchr implementations may be informative. There are multiple hand-written asm implementations in open-source libc implementations, and other places. (e.g. glibc, and Agner Fog's asmlib.)

Many of glibc's versions scan up to an alignment boundary, then use an unrolled loop that reads 64B at a time (in 4 SSE vectors, since I don't think glibc has an AVX2 version).

To optimize for long strings, reduce overhead from testing the compare results by ORing the compare results together, and check that. If you find a hit, go back and re-test your vectors to see which vector had the hit.

It may be somewhat more efficient to do the ffs on one 64-bit integer that you built up out of multiple movemask results (with shift and |). I'm not sure about doing this inside the loop before testing for zero; I don't remember if one of glibc's strlen strategies did that or not.


Everything I've suggested here is stuff can be seen in asm in various glibc strategies for strlen, memchr, and related functions. Here's sysdeps/x86_64/strlen.S, but I there may be another source file somewhere using more than baseline SSE2. (Or not, I might be thinking of a different function, maybe there's nothing to be gained beyond SSE2, until AVX (3-operand insns) and AVX2 (256b integer vectors).

See also:


glibc's memchr uses PMAXUB instead of POR. I'm not sure if that's useful for some arcane microarchitectural reason, but it runs on fewer ports on most CPUs. Perhaps that's desired, to avoid resource conflicts with something else? IDK, seems weird, since it competes with PCMPEQB.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • The thought behind _mm_movemask_epi8 was that it looks like it is faster on newer processors than _mm256_movemask_epi8, even if it needs to be called twice. If it doesn't, then you get a savings on avoiding the extra call. This of course seems to be processor dependent, so on Haswell where they have equal latencies, the larger call (i.e. _mm256_movemask_epi8) seems to be a better approach. – Jimbo Dec 01 '16 at 21:44
  • @Jimbo: oh hmm, I hadn't notice that `PMOVMSKB r, v` in Agner Fog's table for Skylake is listed as 2-3c latency. On Haswell, `VMOVMSKPS/D r32, ymm` is 2c latency, but the xmm version is 3c latency! That's surprising. Where are you seeing that the 256b version is slower? Are you sure the ymm version isn't faster on Skylake? – Peter Cordes Dec 02 '16 at 00:43
  • @Jimbo: Anyway, the difference is at most one cycle of latency and no extra uops or throughput. **`_mm256_movemask_epi8` is still the best you can do**. Nothing you could do with the two halves separately can possibly be as good as just using one VPMOVMSKB r32, ymm. Using a 128b movmsk on the upper lane would require extracting it first to the low 128b of a register, with a 3-cycle latency lane-crossing shuffle like VEXTRACTF128. – Peter Cordes Dec 02 '16 at 00:48
  • Anyway, keep in mind that testing the mask for a loop condition is only sensitive to latency for detecting mispredicts, and for feeding the mask to BSF or TZCNT (`ffs`) after the last iteration. Speculative execution with branch prediction means that every conditional-branch instruction is a separate dependency chain. I.e. control dependencies are not data dependencies. Shorter latency on the flag input to a JCC doesn't affect throughput, only the latency before a branch mispredict can be detected. – Peter Cordes Dec 02 '16 at 00:51