why is strchr twice as fast as my simd code

Question

I am learning SIMD and was curious to see whether it was possible to beat strchr at finding a character. It appears that strchr uses the same intrinsics but I assume that it checks for a null, whereas I know the character is in the array and plan on avoiding a null check.

My code is:

size_t N = 1e9;
bool found = false; //Not really used ...
size_t char_index1 = 0;
size_t char_index2 = 0;
char * str = malloc(N);
memset(str,'a',N);

__m256i char_match;
__m256i str_simd;
__m256i result;
__m256i* pSrc1;

int simd_mask;

str[(size_t)5e8] = 'b';


    char_match = _mm256_set1_epi8('b');
    result = _mm256_set1_epi32(0);

    simd_mask = 0;

    pSrc1 = (__m256i *)str;

    while (1){
        str_simd  = _mm256_lddqu_si256(pSrc1);
        result = _mm256_cmpeq_epi8(str_simd, char_match);
        simd_mask = _mm256_movemask_epi8(result);   
        if (simd_mask != 0){
            break;
        }
        pSrc1++;
    }

Full (not yet finished code) at: https://gist.github.com/JimHokanson/433e185ba53b41e49ce3ac804568ac1e

strchr is twice as fast as this code (using gcc and xcode). I'm hoping to understand why.

Update: compiling using: gcc -std=c11 -mavx2 -mlzcnt

Related post: https://stackoverflow.com/questions/40915243/find-the-first-instance-of-a-character-using-simd — Jimbo, Nov 12 '17 at 05:33
On most systems it's possible to get the source for the standard functions like `strchr`, or at the very least you could always examine the generated machine code for it. I suggest you study the source (or machine code) to see what it does. My guess is that over the 40 or so years the function have existed that it has been tuned and optimized quite a bit. — Some programmer dude, Nov 12 '17 at 05:34
@Brian I had completely forgot to set any optimization flag! Setting -O3 I'm now at roughly 75% of strchr execution time as opposed to 200%. It would still be interesting to know why. — Jimbo, Nov 12 '17 at 05:37
@Someprogrammerdude I'm not exactly sure how to study the machine code. Importantly though it is generally possible to beat heavily optimized code if you are making simplifying assumptions that the reference code is not making. — Jimbo, Nov 12 '17 at 05:39
See https://stackoverflow.com/questions/38552116/how-to-remove-noise-from-gcc-clang-assembly-output. For more about looking at compiler output. Especially the link to Matt Godbolt's CppCon2017 talk on youtube about reading compiler output (and x86 asm in general). — Peter Cordes, Nov 12 '17 at 05:50
It appears you are also assuming that reading 32 bytes from the start of your string will never fault. That's not true for short strings near the end of a page. And BTW, gcc doesn't unroll loops by default, so I wouldn't count on that loop being optimal if your data is hot in L1D cache (so memory isn't a bottleneck.) BTW, `_mm256_lddqu_si256` has zero advantage over `_mm256_loadu_si256`. Hopefully they both compile to `vmovdqu`. — Peter Cordes, Nov 12 '17 at 05:52
@PeterCordes Thanks for the info. FWIW this isn't a final version of the code. I'll update my answer accordingly. — Jimbo, Nov 12 '17 at 06:02
[This is glibc's `strchr-avx2.S`](https://code.woboq.org/userspace/glibc/sysdeps/x86_64/multiarch/strchr-avx2.S.html). Yes, it has to do about 2x as much work checking for nul as well as the character. But notice how they unroll by 4 vectors and OR the compare results together to save on `vpmovmskb` / branch throughput. This is [`memchr-avx2.S`](https://code.woboq.org/userspace/glibc/sysdeps/x86_64/multiarch/memchr-avx2.S.html) where they don't check for nul. Pretty nice asm, unaligned startup and aligned inner loop. — Peter Cordes, Nov 12 '17 at 06:04

Jimbo · Answer 1 · 2017-11-12T06:09:50.607

I had not set an optimization flag in the compiler. Setting -O3 resulted in the SIMD code only taking 75% of the time of strchr.

Update: I should also clarify this is not a final working version of the code. There are still additional checks that need to be put in place and possible ways of optimizing the calls (I think). At least at this point though the code is in the ballpark of strchr. As pointed out in the question comments this version could read past a page and fault. Finally, this is mostly a SIMD learning opportunity (for myself), and memchr is probably your best bet (although I suspect you might be able to just slightly beat memchr if you have a sentinel buffer).

why is strchr twice as fast as my simd code

1 Answers1

Linked