I'm developing a bioinformatic tool. I'm interested in applying SIMD to boost its speed. Given two strings of equal length, I would like to rapidly count the total number of indices where the two strings have identical characters.
For example, say we have S1="AATTGGCCAAC" and S2="AATTCTCCAAC". Then, since their lengths are 11 and only differ at position 5 and 6 ("GG" in S1 and "CT" in S2), the output should be 9.
Here is what I have so far:
#include <string>
#include <immintrin.h>
using namespace std;
#include <memory.h>
int main()
{
register __m128i str_A, str_B, char_eq;
str_A = _mm_load_si128((__m128i*)("AATTGGCCAAC"));
str_B = _mm_load_si128((__m128i*)("AATTCTCCAAC"));
char_eq = _mm_cmpeq_epi8(str_A, str_B);
}
String comparison seems to work fine.
uint8_t val[11];
memcpy(val, &char_eq, sizeof(val));
printf("Numerical: %i %i %i %i %i %i %i %i %i %i %i \n",
val[0], val[1], val[2], val[3], val[4], val[5],
val[6], val[7],val[8], val[9], val[10]);
}
, which outputs 255 255 255 255 0 0 255 255 255 255 255
So now I have a register __m128i object called char_eq which contains information on whether each characters match or mismatch. How do turn this __m128i char_eq object into an integer that encodes the number of matching characters? The only way I can think of is to manually add the boolean values up (i.e. 1+1+1+1+0+0+1+1+1+1+1) but this defeats the purpose of using SIMD since that will require length(str) number of additions.
What is the fastest way to find the total number of matching characters in two strings? I hope to make it O(1). Thank you in advance!