2

Simd algorithm for sub-string search in a 2016 paper:

bool like(const uint8_t* string, __m128i pat, [...]) {
    size_t i = 0;
    while (i + 16 < str_len) {
        __m128i str = _mm_loadu_si128(&string[i]);
        size_t j = _mm_cmpistri(pat, str, 12);  // mode 12
        if (j >= 16) i += 16;
        else {
            if (j + pat_len <= 16) return true;
            i += j;
        }
    }
    // Process remainder
    if (i + pat_len <= str_len) {
        __m128i str = _mm_loadu_si128(&string[i]);
        size_t j = _mm_cmpestri(pat, pat_len,
                                str, str_len - i, 12);
        if (j < 16 && j + pat_len <= 16) return true;
    }
    return false;
}

What is mode 12 of _mm_cmpistri?

Is this quite slow?

Thanks.

Duh Huh
  • 139
  • 1
  • 1
  • 7
  • Intel's ISA ref manual has a whole section about modes, separate from the entry on https://www.felixcloutier.com/x86/pcmpistri. See also https://www.strchr.com/strcmp_and_strlen_using_sse_4.2 for a more readable guide to the various modes. Normally you'd write the mode in hex or binary, not decimal, because it has multiple bit-fields. `12 = 0b1100`. – Peter Cordes Dec 26 '18 at 00:43
  • Thanks. I didn't get the 0b1100 bits. I have just started to self-study this yesterday. The instruction uses port 0 for 3 cycles on Haswell. I ran the benchmark at https://github.com/WojciechMula/sse4-strstr on a haswell server but the generic avx2 is only marginally faster than strstr for the test data. – Duh Huh Dec 26 '18 at 00:53
  • 1
    So it means 'equal ordered' and treat the data as unsigned bytes. https://www.officedaytime.com/simd512e/simdimg/str.php?f=pcmpistri – Duh Huh Dec 26 '18 at 01:14
  • 1
    I guess mixing these intrinsics with other avx intrinsic won't incur penalty to transit from SSE to AVX. Both `_mm_loadu_si128` and `_mm_cmpistri` can represent instructions with `vex` prefix. – Duh Huh Dec 27 '18 at 18:59
  • That's correct. If you compile with AVX enabled, you'll get `vpcmpistri`. (You need to enable it in your compiler options, though, even for compilers like MSVC that normally allow using intrinsics without enabling the corresponding instruction sets. e.g. `-arch:AVX` for MSVC, or the usual GCC `-march=haswell` or `-mavx` or whatever.) – Peter Cordes Dec 28 '18 at 12:47

1 Answers1

4

pcmpistri has one per 2 clock throughput on Ryzen, one per 3 clocks on Skylake. It's one of the faster SSE4.2 string instructions, faster than the explicit-length ones. (https://agner.org/optimize/). It's pretty good for substring searches, but not for simpler strchr / memchr searches: How much faster are SSE4.2 string instructions than SSE2 for memcmp? and SSE42 & STTNI - PcmpEstrM is twice slower than PcmpIstrM, is it true?

Note that your title mentions _mm_cmpestri, the slow version for explicit-length strings. But your code uses _mm_cmpistri, the fast version for implicit-length strings.

(The rest of the code in that search loop should compile pretty efficiently. If the compiler uses a branch instead of cmov for the i+=16 vs. i+=j condition, branch prediction + speculative execution will hide the dependency so multiple iterations can be in flight at once, at the cost of a branch miss for most cases of finding a partial match at the end of an input vector. At least I think that's what the condition is for. Using cmov would create a data dependency between input vectors, and the instruction's latency is ~2 or 3x its throughput.)

I don't know how well it compares with a well-tuned strstr using AVX2 that avoids the SSE4.2 string instructions. I'd guess it might depend on the length of the substring you're searching for, or maybe other properties of the data like how many false-positive candidates for the start or end of the string you find.

The microbenchmarks you already found on https://github.com/WojciechMula/sse4-strstr should be good. Wojciech writes good code, and understands enough about tuning for various x86 uarches to really optimize well. I haven't looked at his string benchmarks, but I have looked at his popcnt code that explores using Harley-Seal with AVX512F vpernternlogd for a big speedup.


Intel's ISA ref manual (vol.2) has a whole section about modes for string instructions (Section 4.1, “Imm8 Control Byte Operation for PCMPESTRI / PCMPESTRM / PCMPISTRI / PCMPISTRM”), separate from the entry on https://www.felixcloutier.com/x86/pcmpistri.

Normally you'd write the mode in hex or binary, not decimal, because it has multiple bit-fields. 12 = 0b00001100.

Intel's intrinsics guide also has pseudocode for the full details on the operation, but it's pretty heavy going if you don't know the high-level purpose. Once you do, it can be helpful. https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=2403,6062,4147,948&techs=SSE4_2,AVX,AVX2&text=pcmpi


See also https://www.strchr.com/strcmp_and_strlen_using_sse_4.2 for a more readable guide to the various modes. Quoting part of it here:

Aggregation operations

The heart of a string-processing instruction is the aggregation operation (immediate bits [3:2]).

  • ...

  • Equal ordered (imm[3:2] = 11). Substring search (strstr). The first operand contains a string to search for, the second is a string to search in. The bit mask includes 1 if the substring is found at the corresponding position:

     operand2 = "WhenWeWillBeWed!", operand1 = "We"
     IntRes1  =  000010000000100
    

After computing the aggregation function, IntRes1 can be complemented, expanded into byte mask (_mm_cmpistrm) or shrinked into index (_mm_cmpistri). The result is written into xmm0 or ECX registers. Intel manual explains these details well, so there is no need to repeat them here.

The low 2 bits of the byte (00) indicate the character format: in this case 00 unsigned BYTE.

(signed vs. unsigned is probably irrelevant for a mode that compares for equality, not range-based.)

Bits 5:4 are the "polarity", for dealing with the end of the string, I think.

Bit 6 is the bit-scan direction for the "index" versions of the instruction that return an index instead of mask. (like bsr vs. bsf). 0 in this case finds the start of the first match instead of the end of the last match.

Bit 7 (the high bit of the 8-bit immediate) is unused / reserved.

See also https://www.officedaytime.com/simd512e/simdimg/str.php?f=pcmpistri for tables / diagrams of the steps that lead to a result, and how different fields in the immediate modify / select the operations performed at various steps.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Quote accidentally includes your comment at the end, I think. – BeeOnRope Dec 26 '18 at 04:42
  • A source says that mode `0b01001100` would return the most significant index among the starting positions of all matches/partial matches. mode `0b00001100` returns the least significant index. Don't know if that is correct. https://www.officedaytime.com/simd512e/simdimg/str.php?f=pcmpistri – Duh Huh Dec 26 '18 at 08:11
  • @BeeOnRope: The *After computing the aggregation function* paragraph is still quoted from that page, it comes after the bullet list (where I omitted the other 3 modes). – Peter Cordes Dec 26 '18 at 08:22
  • @DuhHuh: Oh right, yeah there are more fields in the immediate I didn't mention. Updated with that link, it's a good one. – Peter Cordes Dec 26 '18 at 08:30
  • "branch prediction + speculative execution will hide the dependency". In each iteration, does the cpu run two instructions of `_mm_loadu_si128(&string[i]);` for the `i += 16` and `i += j` cases at ports 2 or 3? If the cpu does this, then the address for loading would not depend on the result of `_mm_cmpistri(pat, str, 12)` in last iteration. How many cycles would the cpu waste for branch misprediction? – Duh Huh Dec 26 '18 at 09:03
  • @DuhHuh: `_mm_loadu_si128` isn't an asm instruction. That intrinsic in this case can compile to a `movdqu` instruction, or to a memory source operand for `vpcmpistri` (if compiling with AVX enabled, so an unaligned source is allowed). In other cases it might even be optimized away... Anyway no, there's no reason for two loads per iteration. I'm talking about whether the CPU has to wait for the result of a previous compare as part of the dependency chain for the *address calculation*. Anyway, branch miss penalty depends on the uarch and surrounding code. – Peter Cordes Dec 26 '18 at 09:04
  • See Agner Fog's microarch guide and optimization guide to understand more about how uops flow through a CPU, and other links in https://stackoverflow.com/tags/x86/info, especially David Kanter's http://www.realworldtech.com/haswell-cpu/ Haswell uarch write-up, and his earlier Sandybridge one. – Peter Cordes Dec 26 '18 at 09:06
  • Oops nevermind I thought it was a quote from the Intel manual which wouldn't then be mentioning the "Intel manual" but I see that it was from strchr. – BeeOnRope Dec 26 '18 at 12:14