Although I still doubt that the string comparison is really the bottleneck of your project, I couldn't resist to take up the challenge...
All of your sequences are 13
chars long. DNA sequences contain only the letters ATCG
, which can be encoded within 2 bits. You can store each DNA sequence within a 32 bit value, letting the computer do the comparison in parallel:
- XOR-combine the values to get bit differences
- shift and OR-combine AND-normalized subsets (odd bits, even bits) to
transform bit differences into nucleobase differences
- count the set bits to obtain the DNA sequence distance
Depending on the computer architecture there may be a bit count function
implemented in the CPU. More details have the answers to the question: How
to count the number of set bits in a 32-bit
integer?
Here is the core function:
int distV(const unsigned va, const unsigned vb)
{
const unsigned x = va ^ vb;
const unsigned bn = ((x & 0xaaaaaaaa) >> 1 ) | (x & 0x55555555);
return __builtin_popcount(bn);
}
See the full GCC-4.3.2 demo that uses sequences of length 16. I measured a performance increment of factor 4 for the comparison itself (excluding the encoding).