How can I get the compiler to output faster code for a string search loop, using SIMD vectorization and/or parallelization?

Question

I have this C:

#include <stddef.h>
size_t findChar(unsigned int length, char*  __attribute__((aligned(16))) restrict string) {
    for (size_t i = 0; i < length; i += 2) {
        if (string[i] == '[' || string[i] == ' ') {
            return i;
        }
    }
    return -1;
}

It checks every other character of a string and returns the first index of the string that is [ or . With x86-64 GCC 10.2 -O3 -march=skylake -mtune=skylake, this is the assembly output:

findChar:
        mov     edi, edi
        test    rdi, rdi
        je      .L4
        xor     eax, eax
.L3:
        movzx   edx, BYTE PTR [rsi+rax]
        cmp     dl, 91
        je      .L1
        cmp     dl, 32
        je      .L1
        add     rax, 2
        cmp     rax, rdi
        jb      .L3
.L4:
        mov     rax, -1
.L1:
        ret

It seems like it could be optimized significantly, because I see multiple branches. How can I write my C so that the compiler optimizes it with SIMD, string instructions, and/or vectorization?

How do I write my code to signal to the compiler that this code can be optimized?

Interactive assembly output on Godbolt: https://godbolt.org/z/W19Gz8x73

Changing it to a VLA with an explicitly declared length doesn't help much: https://godbolt.org/z/bb5fzbdM1

This is the version of the code modified so that the function would only return every 100 characters: https://godbolt.org/z/h8MjbP1cf

I see three branches: `i < length`, `string[i] == '['`, and `string[i] == ' '`. Are any of them optional? — Robert Harvey, Apr 05 '21 at 20:20
@RobertHarvey No. Is there a way to implement these without cmp/jmp on an assembly level? — noɥʇʎԀʎzɐɹƆ, Apr 05 '21 at 20:28
I don't see how. You still need to make the checks, and cmp/jmp is the way assembly does this. — Robert Harvey, Apr 05 '21 at 20:31
@RobertHarvey Would it be possible e.g. compare 8 bytes at a time with a bitwise comparison and use a branch at the end of the 8 bytes? I've seen GCC write code like that before. — noɥʇʎԀʎzɐɹƆ, Apr 05 '21 at 20:32
The compiler cannot vectorise the code because the code does not access the string beyond a matching character. This character might be the last character on the last mapped page. So the compiler cannot safely generate code that fetches multiple characters and check them in parallel. You can e.g. change the code such that it simply sets a variable on match and proceeds through the rest of the string. This way, the compiler can make more assumptions about what memory accesses it may perform. — fuz, Apr 05 '21 at 20:34
@fuz Adding a length parameter doesn't change the code: https://godbolt.org/z/zsv3eGGxb Does using a VLA signal that the entire string is a valid array? — noɥʇʎԀʎzɐɹƆ, Apr 05 '21 at 20:38
@fuz: Not true; a compiler targeting a specific mainstream OS knows that memory protection has page granularity, not segmentation with some arbitrary byte limit, so it can use code that works the same way as the hand-written asm for `strlen` or `strchr` in libc. [Is it safe to read past the end of a buffer within the same page on x86 and x64?](https://stackoverflow.com/q/37800739). This is actually just a missed optimization in GCC/clang. ICC does know how to auto-vectorize loops whose trip-count can't be calculated ahead of time (e.g. search loops) — Peter Cordes, Apr 05 '21 at 20:39
@fuz Doing exactly what you suggested results in different assembly, but it doesn't seem like it is doing any special optimizations. https://godbolt.org/z/Po6rd1WWY — noɥʇʎԀʎzɐɹƆ, Apr 05 '21 at 20:40
@noɥʇʎԀʎzɐɹƆ Interesting. Not sure how to coax the compiler into optimising it then. — fuz, Apr 05 '21 at 20:46
@fuz and OP: `const char str[length]` as a function arg still doesn't promise the compiler it can touch memory other than what the abstract machine does. It's still exactly equivalent to `const char *str`. C99 *does* have syntax like `const char str[static 100]` which might even work with a variable length, but IIRC GCC doesn't usually take advantage anyway. ([What is the purpose of static keyword in array parameter of function like "char s\[static 10\]"?](https://stackoverflow.com/q/3430315)) — Peter Cordes, Apr 05 '21 at 20:52
Of course, even with a static array where the size is definitely known, GCC still won't do *this* optimization; only ICC's auto-vectorizer can handle loops whose trip-count can't be calculated before the first iteration runs. Are you interested in how to optimize this for x86 specifically, with SSE or AVX intrinsics like `_mm_cmpeq_epi8` / `_mm_movemask_epi8`, or are you still looking to keep it portable? — Peter Cordes, Apr 05 '21 at 20:52
@noɥʇʎԀʎzɐɹƆ On average, how many characters to you expect the code to search before finding a match? Are you willing to rewrite the source code to allow examining the string in `uint64_t`-size chunks, using portable C code? — njuffa, Apr 05 '21 at 21:44
@njuffa: note that it's non-trivial to use `uint64_t` safely; you need either `memcpy` or a typedef with GNU C `__attribute__((may_alias))`, like shown in [Why does glibc's strlen need to be so complicated to run quickly?](https://stackoverflow.com/a/57676035) where my answer shows how to fix the strict-aliasing bugs in glibc's portable-C fallback version. So you might want to link or reference that if you're planning to write a version based on https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord — Peter Cordes, Apr 05 '21 at 21:59
@PeterCordes FWIW, my plan would be to write it using the same ideas I used to implement various string functions in Solaris twenty years ago, using naturally aligned 64-bit loads for the bulk of the processing. Those were admittedly written in SPARC assembly language (when I last checked ten years ago my handy work was still visible in the OpenSolaris source base). In this case the source pointer is already guaranteed to be 16-byte aligned if I read that correctly, so casting the pointer via `(void *)` and `uintptr_t` should work, I would think? I will look at your first link. — njuffa, Apr 05 '21 at 22:52
@noɥʇʎԀʎzɐɹƆ - your 2nd Godbolt link (https://godbolt.org/z/Po6rd1WWY) does actually always traverse the whole array, so in theory could auto-vectorize. But recording the match-position is inconvenient / difficult to make asm that actually does that, and it would likely be sub-optimal except possibly in the case of short fixed-length buffers (like 2 or 4 vectors worth) if the function is called very frequently with the same size, so branch prediction can "learn" how many iterations the loop runs. — Peter Cordes, Apr 05 '21 at 22:52
@njuffa: njuffa: alignment isn't the problem, the strict-aliasing rule is. Otherwise I would have included `__attribute((aligned(1), may_alias))`. Accessing a char object through a `uint64_t*` dereference is UB. (That could happen if passed a pointer to a `char array[64]` array object for example. Dynamically allocated memory is anonymous, no known type, so accessing via `uint64_t*` here and `char*` everywhere else would be fine because char* can alias anything. But named variables have types. Still maybe unlikely to cause a problem in practice if array access works like char*, not sure) — Peter Cordes, Apr 05 '21 at 22:58
@njuffa: An example of breakage with a type other than `char` is [gcc, strict-aliasing, and horror stories](https://stackoverflow.com/a/2959468). BTW, Intel intrinsics are defined to avoid these problems: [Is \`reinterpret\_cast\`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?](https://stackoverflow.com/q/52112605) — Peter Cordes, Apr 05 '21 at 23:00
@PeterCordes I am afraid you lost me there. Here we have a `char *` to begin with. Casting via `void *` used to work just fine to tell the compiler to forget the type a pointer used to point to (leaving issues about making sure accesses are naturally aligned), and is standard compliant by my reading of the C standard. But I didn't go to language-lawyer school. [Later:] Read your link. I am with Linus on this one. — njuffa, Apr 05 '21 at 23:14
@njuffa The `[` and ` ` is an optimized case. However, I could also split it into two single character searches of `-` and `[` looking at every char, with ~100 between. For finding both `[` and ` `, 100 characters looking at every other char, but every 4-5 chars, it will find a ` ` and need to execute an additional check. I am willing to rewrite. — noɥʇʎԀʎzɐɹƆ, Apr 06 '21 at 00:10
@njuffa This the function modified to have ~100 characters between a return: https://godbolt.org/z/h8MjbP1cf (It skips over 2 characters at a time, hence the '-' check) — noɥʇʎԀʎzɐɹƆ, Apr 06 '21 at 00:14
@noɥʇʎԀʎzɐɹƆ Have you tried simply making two calls to system-provided `strchr()`, which presumably is highly optimized? Assuming your strings are not overly long, the second call would benefit from the first call pulling the data into caches. The above comments by Peter Cordes have convinced me that it would be a royal pain to try and write fast string functions in C instead of assembly language, at least when using gcc. — njuffa, Apr 06 '21 at 00:40
@njuffa: The `char*` function arg has to be pointing to something, and it's UB if that's an object that's definitely not `uint64_t`, unless you use `memcpy` into `uint64_t` instead of deref of `uint64_t*`, or a typedef. (In practice this matters after function inlining; e.g. violating strict aliasing can mean it's not safe to compile with LTO). Just for example, say the original data was an array of `unsigned short` and that's why only every 2nd char matters. After inlining, the compiler can assume that `uint64_t*` derefs aren't reading the same data that `my_u16[i] = '['` wrote. — Peter Cordes, Apr 06 '21 at 01:49
@njuffa: Passing a pointer via a `char*` or `void*` function arg doesn't "launder" it in terms of removing strict-aliasing UB; it doesn't make it safe to deref it as types other than the original. That's true even if the original data was a `char buf[100]`. It would be fine if you had `char *buf = malloc(100);` though, because then the only accesses to it would be via `char*` or in fast-strings functions, as long as all your fast-strings stuff uses the same type. (`char*` is allowed to alias anything, like `__m128i*`, or `my_aliasing_u64*` with a GNU C typedef.) — Peter Cordes, Apr 06 '21 at 01:53
@njuffa: for portable C, I'd suggest writing `uint64_t aliasing_u64_load(void *p) { uint64_t tmp; memcpy(tmp, p, sizeof(tmp)); return tmp; }`. (That also makes unaligned loads safe, so GCC won't always inline it as a single load instruction if it can't prove alignment, on ISAs where unaligned word loads aren't safe.) — Peter Cordes, Apr 06 '21 at 01:55

Soonts · Answer 1 · 2021-04-06T17:49:39.137

2

I don’t know how to convince compiler to emit good auto-vectorized code for that. But I know how to vectorize manually. Since you’re compiling for Skylake, here’s AVX2 version of your function. Untested.

#include <stddef.h>
#include <immintrin.h>

ptrdiff_t findCharAvx2( size_t length, const char* str )
{
    const __m256i andMask = _mm256_set1_epi16( 0xFF );
    const __m256i search1 = _mm256_set1_epi16( '[' );
    const __m256i search2 = _mm256_set1_epi16( ' ' );

    const char* const ptrStart = str;
    const char* const ptrEnd = str + length;
    const char* const ptrEndAligned = str + ( length / 32 ) * 32;
    for( ; str < ptrEndAligned; str += 32 )
    {
        // Load 32 bytes, zero out half of them
        __m256i vec = _mm256_loadu_si256( ( const __m256i * )str );
        vec = _mm256_and_si256( andMask, vec );

        // Compare 16-bit lanes for equality, combine with OR
        const __m256i cmp1 = _mm256_cmpeq_epi16( vec, search1 );
        const __m256i cmp2 = _mm256_cmpeq_epi16( vec, search2 );
        const __m256i any = _mm256_or_si256( cmp1, cmp2 );
        const int mask = _mm256_movemask_epi8( any );

        // If neither character is found, mask will be 0.
        // Otherwise, the least significant set bit = index of the first matching byte in `any` vector
#ifdef _MSC_VER
        unsigned long bitIndex;
        // That's how actual instruction works, it returns 2 things at once, flag and index
        if( 0 == _BitScanForward( &bitIndex, (unsigned long)mask ) )
            continue;
#else
        if( 0 == mask )
            continue;
        const int bitIndex = __builtin_ctz( mask );
#endif
        return ( str - ptrStart ) + bitIndex;
    }

    // Handle the remainder
    for( ; str < ptrEnd; str += 2 )
    {
        const char c = *str;
        if( c == '[' || c == ' ' )
            return str - ptrStart;
    }
    return -1;
}

edited Apr 06 '21 at 17:49

answered Apr 06 '21 at 15:55

Soonts

20,079
9
57
130

Why `__builtin_ffs( mask ) - 1;` instead of `__builtin_ctz( mask )`? (count trailing zeros = BSF or TZCNT) – Peter Cordes Apr 06 '21 at 17:37
You could avoid the SIMD AND mask and instead mask the `movemask` result. Since you branch on it being non-zero anyway, hopefully the compiler can just use `and reg,0x55555555/jnz found` instead of `test reg,reg/jnz found`. Can still macro-fuse on Intel but not AMD. Loading with a memory-source VPAND instead of a separate VMOVDQU is cheap (especially if it avoids an indexed addressing mode on Intel so it's still a single micro-fused uop), but it does need another SIMD ALU uop in the back-end. – Peter Cordes Apr 06 '21 at 17:42
`vpshufb` could duplicate each byte to the containing word instead of AND, setting up for a single `vpcmpeqb` with `set1_epi16('[' << 8 | ' ')`, then I guess `ctz(mask) >> 1`. Alternatively, 2x VPAND / VPACKUSWB sets up for 2x compare + OR of 2 vectors at once (and then you have to sort out the data position from in-lane shuffling if you find a hit). But I think if you're going to shuffle to increase data density, VPSHUFB within one vector is best. – Peter Cordes Apr 06 '21 at 17:50
@PeterCordes OK, changed the gcc builtin. Bitwise SIMD instructions are very cheap, the throughput is 1/3 cycles on Intel, 1/4 cycles on AMD. And gcc does fuse the load: https://godbolt.org/z/78Gqbva8W – Soonts Apr 06 '21 at 17:56
If `length >= 32` (or 16 or 8), it should be possible to do cleanup with a final unaligned vector that ends at the end of the array, overlapping and re-checking some number of elements depending on len%32. You only need *scalar* cleanup if the array is too small for even a single vector. (Even then, you could implement alignment checking / masking.) – Peter Cordes Apr 06 '21 at 17:57
@PeterCordes Yeah, I thought about that trick, decided I don’t like the overhead in complexity One still needs a scalar loop for small input arrays. My version is already way more complicated than OP’s original code. – Soonts Apr 06 '21 at 18:04
I know they're cheap, but you have 5 SIMD ALU operations per loop, and only 3 SIMD ALU ports on Intel. (Although it's a 9-uop loop so even IceLake's 5-wide front-end will have a hard time saturating the ALU ports). Also, you have three 32-byte constants instead of two, so that touches at least 2 cache-lines (because current GCC and clang are dumb and don't load them with `vpbroadcastd`, like 1 extra code byte each to reduce 32 bytes to 4 bytes.) Anyway, on Intel I'm pretty sure `mask &= 0x55555555;` is strictly better, except for possible code size / alignment effects. And not bad on AMD. – Peter Cordes Apr 06 '21 at 18:07
Yeah, avoiding scalar cleanup takes more work to code, although a helper function can reduce the repeated work. Still, for small strings like 30 bytes, it's all scalar, or for a 60 byte string it's one vector and 28 scalar iterations. Might be better to just use 128-bit vectors if you expect short strings to be common. (So yeah, tuning strongly depends on your expected use-case. If short strings are important, some kind of alignment-check to handle buffers less than 1 full vector can let you drop the scalar cleanup. e.g. load and check for `mask <= 1U< – Peter Cordes Apr 06 '21 at 18:40
How many characters does it need to scan at a time to be worth using? 8? 100? – noɥʇʎԀʎzɐɹƆ Apr 06 '21 at 19:37
Is it possible to vectorize this function? https://godbolt.org/z/h8MjbP1cf It returns ~ every 100 characters. – noɥʇʎԀʎzɐɹƆ Apr 06 '21 at 19:38
@noɥʇʎԀʎzɐɹƆ I think even 8 will be faster than scalar code. SIMD is implemented inside CPU cores, latency overhead for passing data between vectors and general-purpose registers is just a couple CPU cycles. – Soonts Apr 08 '21 at 14:57
@noɥʇʎԀʎzɐɹƆ Of course, it is possible: https://godbolt.org/z/bPe9n6Kvv – Soonts Apr 08 '21 at 15:24

How can I get the compiler to output faster code for a string search loop, using SIMD vectorization and/or parallelization?

1 Answers1

Linked