SIMD: Compare two strings char by char and find the total number of matches

Question

I'm developing a bioinformatic tool. I'm interested in applying SIMD to boost its speed. Given two strings of equal length, I would like to rapidly count the total number of indices where the two strings have identical characters.

For example, say we have S1="AATTGGCCAAC" and S2="AATTCTCCAAC". Then, since their lengths are 11 and only differ at position 5 and 6 ("GG" in S1 and "CT" in S2), the output should be 9.

Here is what I have so far:

#include <string>
#include <immintrin.h>
using namespace std;
#include <memory.h>
int main()
{
 register __m128i str_A, str_B, char_eq;


str_A = _mm_load_si128((__m128i*)("AATTGGCCAAC")); 
str_B = _mm_load_si128((__m128i*)("AATTCTCCAAC")); 

char_eq = _mm_cmpeq_epi8(str_A, str_B); 

}

String comparison seems to work fine.

uint8_t val[11];
    memcpy(val, &char_eq, sizeof(val));
    printf("Numerical: %i %i %i %i %i %i %i %i %i %i %i \n", 
           val[0], val[1], val[2], val[3], val[4], val[5], 
           val[6], val[7],val[8], val[9], val[10]);          
}

, which outputs 255 255 255 255 0 0 255 255 255 255 255

So now I have a register __m128i object called char_eq which contains information on whether each characters match or mismatch. How do turn this __m128i char_eq object into an integer that encodes the number of matching characters? The only way I can think of is to manually add the boolean values up (i.e. 1+1+1+1+0+0+1+1+1+1+1) but this defeats the purpose of using SIMD since that will require length(str) number of additions.

What is the fastest way to find the total number of matching characters in two strings? I hope to make it O(1). Thank you in advance!

Note that the data pointed to by the pointer argument of `_mm_load_si128` [must be 16-byte aligned](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm_load_si128&expand=3343). — Daniel Langr, Apr 21 '21 at 06:13
You can count the number of set bits in `char_eq` and shift the result right by 3 (divide by 8). Relevant question: [Fast counting the number of set bits in __m128i register](https://stackoverflow.com/q/17354971/580083). — Daniel Langr, Apr 21 '21 at 06:21
Is it a typical input string or do you plan to make it run on bigger string? Is the size fixed? If not, do you know the size of the string before computing them? — Jérôme Richard, Apr 21 '21 at 20:23
@DanielLangr _m128i at most takes 16 characters, but my strings will be smaller than that (e.g. 15). The code in "Fast counting ... " that you shared works well for 16-character long strings but for shorter strings, it doesn't work, probably because _mm_cmpeq_epi8 and other functions operate on bits of _m128i that are not occupied because the input strings are shorter than 128 bits. How do I handle this issue? — JWO, Apr 21 '21 at 22:35
@JérômeRichard It's fixed size, and I know the size of the string before computing them. The size will be about 15 characters — JWO, Apr 21 '21 at 22:35
@JWO Since strings need to by 16 byte aligned, it would make sense to store them as 16-character long strings padded with zero/null characters (not `'0'`). — Daniel Langr, Apr 22 '21 at 03:39

SIMD: Compare two strings char by char and find the total number of matches

0 Answers0