Count number of matching bytes between two _m128i SIMD vectors

Question

I'm developing a bioinformatics tool and I'm trying to use SIMD to boost its speed.

Given two char arrays of length 16, I need to rapidly count the number of indices at which the strings match. For example, the two following strings, "TTTTTTTTTTTTTTTT" and "AAAAGGGGTTTTCCCC", match from 9th through 12th positions ("TTTT"), and therefore the output should be 4.

As shown in the following function foo (which works fine but slow), I packed each characters in seq1 and seq2 into __m128i variables s1 and s2, and used _mm_cmpeq_epi8 to compare every position simultaneously. Then, using popcnt128 (from Fast counting the number of set bits in __m128i register by Marat Dukhan) to add up the number of matching bits.

float foo(char* seq1, char* seq2) {
    __m128i s1, s2, ceq;
    int match;
    s1 =  _mm_load_si128((__m128i*)(seq1));
    s2 =  _mm_load_si128((__m128i*)(seq2));
    ceq = _mm_cmpeq_epi8(s1, s2);
    match = (popcnt128(ceq)/8);
    return match;
}

Although popcnt128 by Marat Dukhan is a lot faster than naïvely adding up every bit in __m128i, __popcnt128() is the slowest bottleneck in the function, taking up about 80% of the computational speed. So, I would like to come up with an alternative to popcnt128.

I tried to interpret __m128i ceq as a string and to use it as a key for a precomputed look-up table that maps a string to the total number of bits. If char array were hashable, I could do something like

union{__m128i ceq; char c_arr[16];}
match = table[c_arr] // table = unordered map

If I try to do something similar for strings (i.e. union{__m128i ceq; string s;};), I get the following error message "::()’ is implicitly deleted because the default definition would be ill-formed". When I tried other things, I ran into segmentation faults.

Is there any way I can tell the compiler to read __m128i as string so I can directly use __m128i as a key for unordered_map? I don't see why it shouldn't work because string is a contiguous array of chars, which can be naturally represented by __m128i. But I couldn't get it to work and unable to find any solution online.

I think using a table is not a good idea as I guess there are 2^128 possibilities isn't it? (a lot more than what your RAM can hold). Moreover hashing will probably be slower than the current `popcnt128` function. Beside this accessing `c_arr` the way you do produce an ill-formed program leading to an undefined behaviour in C++ (use `memcpy` instead if you need that). — Jérôme Richard, Apr 26 '21 at 22:53
`popcnt128` does much more than you actually need. Just do `__builtin_popcnt(_mm_movemask_epi8(cnt))`. — chtz, Apr 26 '21 at 22:56
@chtz: or perhaps `psadbw` against 0 (hsum of bytes on the compare result), but that needs a final hsum step of qword halves, so that's going to be worse if HW popcnt is available, but worth considering if you need baseline x86-64 with just SSE2. (Also you need to negate before psadbw, or scale the sum down by 1/255). — Peter Cordes, Apr 27 '21 at 00:01
@JérômeRichard: The OP is proposing a `std::unordered_map` or something, not a flat array, so it could be possible; there are "only" 2^16 entries needed. But of course is horribly slow compared to movemask / popcnt. The amount of work required just to compute a hash function of a 16-byte char array is comparable to just getting the answer. :P — Peter Cordes, Apr 27 '21 at 00:04
Wait, do you need the count as a `float` for your return value? If so, then perhaps `psadbw` would be a good option, to avoid a round trip to scalar int and back. (`cvtdq2ps` is more efficient than `cvtsi2ss`, and could maybe make up for the extra work you'd have to do vs. `_mm_movemask_epi8` + popcnt.) — Peter Cordes, Apr 27 '21 at 00:07
@chtz I tried __builtin_popcnt(_mm_movemask_epi8(cnt)) and it's so much faster now. — JWO, Apr 27 '21 at 01:48
@Peter Cordes I will try your suggestions as well and see if I can further improve performance. — JWO, Apr 27 '21 at 01:48

Peter Cordes · Accepted Answer · 2022-04-04T02:27:14.040

You're probably doing this for longer sequences, multiple SIMD vectors of data. In that case, you can accumulate counts in a vector that you only sum up at the end. It's a lot less efficient to popcount every vector separately.

See How to count character occurrences using SIMD - instead of _mm256_set1_epi8(c); to search for a specific character, load from the other string. Do everything else the same, including
counts = _mm_sub_epi8(counts, _mm_cmpeq_epi8(s1, s2));
in the inner loop, and the loop unrolling. (A compare result is an integer 0 / -1, so subtracting it adds 0 or 1 to another vector.) This is at risk of overflow after 256 iterations, so do at most 255. That linked question uses AVX2, but the __m128i versions of those intrinsics only require SSE2. (Of course, AVX2 would let you get twice as much work done per vector instruction.)

Horizontal sum the byte counters in the outer loop, using _mm_sad_epu8(v, _mm_setzero_si128()); and then accumulating into another vector of counts. Again, this is all in the code in the linked Q&A, so just copy/paste that and add a load from the other string into the inner loop, instead of using a broadcast constant.

Can counting byte matches between two strings be optimized using SIMD? shows basically the same thing for 128-bit vectors, including a version at the bottom that only does SAD hsums after an inner loop. It's written for two input pointers already, rather than char and string.

For a single vector:

You don't need to count all the bits in your __m128i; take advantage of the fact that all 8 bits in each byte are the same by extracting 1 bit per element to a scalar integer. (x86 SIMD can do that efficiently, unlike some other SIMD ISAs)

    count = __builtin_popcnt(_mm_movemask_epi8(cmp_result));

Another possible option is psadbw against 0 (hsum of bytes on the compare result), but that needs a final hsum step of qword halves, so that's going to be worse than HW popcnt. But if you can't compile with -mpopcnt then it's worth considering if you need baseline x86-64 with just SSE2. (Also you need to negate before psadbw, or scale the sum down by 1/255...)

(Note that the psadbw strategy is basically what I described in the first section of the answer, but for only a single vector, not taking advantage of the ability to cheaply add multiple counts into one vector accumulator.)

If you really need the result as a float, then that makes a psadbw strategy less bad: you can keep the value in SIMD vectors the whole time, using _mm_cvtepi32_ps to do packed conversion on the horizontal sum result (even cheaper than cvtsi2ss int->float scalar conversion). _mm_cvtps_f32 is free; a scalar float is just the low element of an XMM register.

But seriously, do you really need an integer count as a float now? Can't you at least wait until you have the sum across all vectors, or keep it integer?

-mpopcnt is implied by gcc -msse4.2, or -march=native on anything less than 10 years old. Core 2 lacked hardware popcnt, but Nehalem had it for Intel.

thanks! This is very helpful. I don't need an integer count as a float. I was being a little sloppy, and keeping it as an integer did improve the performance a bit. — JWO, Apr 27 '21 at 02:18
@JWO: Ok good. Using an inner loop of `sub(count, cmp(load1,load2))` should help significantly more, e.g. 3 instructions (`movdqa` / `pcmpeqb xmm,mem` / `psubb` down from 5 (movdqa / `pcmeqb xmm, mem` / pmovmskb / popcnt / add), plus loop overhead. Especially if your strings are hot in L1d or even L2 cache; if not then either way probably pretty much bottlenecks on memory bandwidth, especially if you can use AVX2. — Peter Cordes, Apr 27 '21 at 02:28
The new 3rd generation Intel Scalable Processors (Ice Lake Xeon) support the VPOPCNT family of instructions. These return the count of bits set per element for all combinations of BYTE/WORD/DWORD/QWORD elements in vectors of width 128/256/512 bits. I have not seen latency/throughput data yet, but these will certainly be faster than a software approach. — John D McCalpin, Apr 27 '21 at 16:03
@JohnDMcCalpin: But this problem doesn't need or even really *want* to count all the bits in a SIMD vector; instead they want to count matching bytes. For that, `pcmpeqb` / `psubb` is optimal for multiple vectors; I don't see a way to do better with SIMD popcnt. Nor for scalar for a single vector; `pmovmskb` / `popcnt` is obviously better than `vpopcntq` / shuffle / `vpaddd` / `vmovd` / `shr eax,3` because you counted 8 bits in every byte. — Peter Cordes, Apr 27 '21 at 22:20
@JohnDMcCalpin: IceLake-client has that extension, too, and https://uops.info/ has performance data. single uop for port 5, 3-cycle latency, can micro-fuse a memory source operand. (Or 5-cycle latency with byte or word masking, otherwise the same perf for all element sizes.) — Peter Cordes, Apr 28 '21 at 01:16

Count number of matching bytes between two _m128i SIMD vectors

1 Answers1

For a single vector:

Linked