calculating how much two strings are similar?

Question

I have a function that calculates the variance of the two given strings. Is there a faster method (or algorithm) to do such thing?

Please keep in mind that every letter of my strings are all loaded with DNA, which means that these are one of the A or T or C or G:

unsigned __int8 dis(char* FirstString, char* SecondString)
{
    unsigned __int8 distanceIndex = 0;
    for (unsigned __int8 i = 0; i < l; i++)
    {
        if (FirstString[i] != SecondString[i])
            distanceIndex++;
    }
    return distanceIndex;
}

How about Levenshtein Distance https://en.wikipedia.org/wiki/Levenshtein_distance ? — Arunmu, Jul 19 '16 at 14:03
You may want to use the [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) like spell checkers do. — NathanOliver, Jul 19 '16 at 14:03
does Levenshtein provide a better time complexity?, can anyone implement such method, like i do in the above? — Shaheen Zahedi, Jul 19 '16 at 14:07
if you only have 4 possible values per element, you can use a more compact representation of your sequence. You can use only 2 bits instead of 8. You may then calculate the hamming distance on that. https://en.wikipedia.org/wiki/Hamming_distance — Alessandro Teruzzi, Jul 19 '16 at 14:08
I was just trying to figure out where `l` came from, and whether it is intentional that both sequences have identical length, as they value `l` and that loop construct seems to have us assume. — WhozCraig, Jul 19 '16 at 14:13
@WhozCraigthe you could assume that the l is a constant value, which in this case is 13 by the way — Shaheen Zahedi, Jul 19 '16 at 14:16
Depending on the compiler and/or target platform, I can _conceive_ that using an explicitly 8-bit int in the loop might possibly be slower than using a natural-size int. — TripeHound, Jul 19 '16 at 15:33
*`__int8 distanceIndex`*? -- I really hope that your strings are always shorter than 128 characters. — Wolf, Jul 20 '16 at 07:40
@ShaheenZahedi If you mean "how can I tell that it will make a difference", then I can't -- just that there's an outside possibility that manipulating 8-bit counters on a 32/64-bit processor could, conceivably, be fractionally slower than using a "natural" 32-bit counter. If there _was_ a difference, I'd expect it to be very marginal, but if you've identified this routine as a bottle-neck, and are calling it enough times that fractional improvements might make a difference, then it _might_ be worth profiling alternatives. — TripeHound, Jul 20 '16 at 07:55
@TripeHound right,so you cannot tell for sure, huh? i guess the only way to find out is to benchmark it — Shaheen Zahedi, Jul 20 '16 at 08:41
@Wolf No, i don't, but this is bottleneck function for my current project that i'm working on, and i need it to be as fast as possible — Shaheen Zahedi, Jul 20 '16 at 09:11
@ShaheenZahedi Correct -- at the level you're talking (e.g. the algorithm isn't demonstrably inefficient), benchmarking one idea vs. another is the only way to tell for sure (and that's benchmarking in as realistic environment as you can. Running that function over 13-character strings will "normally" be "so quick as to be ignorable" and must only be significant in your app (where, I guess, it's being called thousands or millions of times). — TripeHound, Jul 20 '16 at 11:19
@ShaheenZahedi I found [a fast compare applicable for DNA sequences of fixed length up to 16](http://stackoverflow.com/a/38491405/2932052) :) BTW: you should better add the fixed length 13 to the question, this is an important point. — Wolf, Jul 20 '16 at 21:54

Wolf · Accepted Answer · 2016-07-23T12:25:57.087

3

Although I still doubt that the string comparison is really the bottleneck of your project, I couldn't resist to take up the challenge...

All of your sequences are 13 chars long. DNA sequences contain only the letters ATCG, which can be encoded within 2 bits. You can store each DNA sequence within a 32 bit value, letting the computer do the comparison in parallel:

XOR-combine the values to get bit differences
shift and OR-combine AND-normalized subsets (odd bits, even bits) to transform bit differences into nucleobase differences
count the set bits to obtain the DNA sequence distance

Depending on the computer architecture there may be a bit count function implemented in the CPU. More details have the answers to the question: How to count the number of set bits in a 32-bit integer?

Here is the core function:

int distV(const unsigned va, const unsigned vb)
{
    const unsigned x = va ^ vb;
    const unsigned bn = ((x & 0xaaaaaaaa) >> 1 ) | (x & 0x55555555);
    return __builtin_popcount(bn);
}

See the full GCC-4.3.2 demo that uses sequences of length 16. I measured a performance increment of factor 4 for the comparison itself (excluding the encoding).

edited Jul 23 '16 at 12:25

answered Jul 20 '16 at 21:46

Wolf

9,679
7
62
108

It's a really good idea to compress the nucleobase string thanx for mentioning, but can you explain how to do the comparision in parallel? – Shaheen Zahedi Jul 23 '16 at 07:22
It's bit operations. Did you have a look at the demo? I'll add the core function for comparison to my answer. – Wolf Jul 23 '16 at 12:24
That would be great :) – Shaheen Zahedi Jul 24 '16 at 06:44
@ShaheenZahedi it's **already added**, see the `distV` function: it compares 32 bits (that is 16 nucleobases in the given encoding) in parallel. – Wolf Jul 24 '16 at 07:09
oh sorry, I see, that's truly fastest,but it's take amount of time for encode though, is it worth it, that i re-implement all the project to work with encoded strings? – Shaheen Zahedi Jul 24 '16 at 07:45
@ShaheenZahedi It depends on your actual needs. The encoding has its cost of course. But (I did not measure it) I guess, it's nearly as cheep as a normal string compare. There can be **another benefit** for your purpose: It **saves memory** (4 bytes instead of at least 13 bytes) and therefore most probably also faster (Less disk swapping. Aligned memory access on a 32 bit machines). By the was, concerning your *`truly fastes`*, I bet there are even faster alternatives out there... ;) – Wolf Jul 24 '16 at 10:21
@ShaheenZahedi I measured the time consumption for encoding: on my machine it takes 168% of the time used for string comparison (I built a 32bit program). – Wolf Jul 24 '16 at 13:17

score 1 · Answer 2 · answered Jul 19 '16 at 14:02

1

This is an O(n) algorithm.

The most efficient algorithm to compare equality (or distance in this case) between two strings is O(n).

answered Jul 19 '16 at 14:02

Tyler

740
9
27

what about Levenshtein and hamming distance?, is this algorithm better than those? – Shaheen Zahedi Jul 19 '16 at 14:19
Levenshtein complexity is O(n + d^2) where d is the distance. Hamming distance is O(n) as well. – Tyler Jul 19 '16 at 14:27
3

The comment was added after the answer was already posted, but it doesn't really make much sense talking about O(N) when N is just 13, so algorithm complexity, even if right in general, doesn't answer to this question IMHO. – Alessandro Teruzzi Jul 19 '16 at 15:54
I have no idea why OP is even worrying about complexity when n=13. But the above answer is *general*. – Tyler Jul 19 '16 at 16:00
@AlessandroTeruzzi there is really some place for optimisation in this given case (low n), depending on the operand's width of the architecture, see [my answer](http://stackoverflow.com/a/38491405/2932052). – Wolf Jul 24 '16 at 07:16

score 0 · Answer 3 · edited Jul 20 '16 at 07:36

0

You could make it slightly faster by avoiding the random accessing done by indexing, you only actually need sequential accessing of the string.

I'm not sure whether a compiler could optimise that for you though.

edited Jul 20 '16 at 07:36

Wolf

9,679
7
62
108

answered Jul 19 '16 at 14:06

ROX

1,256
8
18

my compiler, can optimize that, there is not difference between these two – Shaheen Zahedi Jul 19 '16 at 14:09

Ohad Eytan · Answer 4 · 2016-07-19T15:05:58.470

0

You can spare the if:

unsigned __int8 dis(char* FirstString, char* SecondString)
{
    unsigned __int8 distanceIndex = 0;
    for (unsigned __int8 i = 0; i < l; i++)
       {
            distanceIndex += FirstString[i] != SecondString[i];
       }
    return distanceIndex;
}

but I doubt if this significant

edited Jul 19 '16 at 15:05

answered Jul 19 '16 at 14:10

Ohad Eytan

8,114
1
22
31

it is not correct, the distanceIndex just get incremented every time – Shaheen Zahedi Jul 19 '16 at 14:56
@ShaheenZahedi I just tried it and it work fine for me (I used regular `unsigned int` if that's matter) – Ohad Eytan Jul 19 '16 at 15:14
it just give me wrong answer,just some weird number, and i don't know what's the problem with your approach:( – Shaheen Zahedi Jul 20 '16 at 08:46
@ShaheenZahedi you can see my live code [here](http://ideone.com/FMhxuW) – Ohad Eytan Jul 20 '16 at 08:51

calculating how much two strings are similar?

4 Answers4