2

I actually have to implement a string comparaison where I get a matching percentage at the end (not just boolean result match/unmatch). So, to do that I've found the Levenstein Distance algorithm. But the issue is now on performance. For instance, I have 1k strings to compare to each other, it takes about 10 minutes now. For each I already call the algo in parallel and again within each it's done in parallel. So i got in pseudo language :

Foreach strings
    Call in parallel the comparaison method.

Within the comparaison method

Foreach stringsToCompare
    Call in parallel the Levenstein Distance algo.

And still 10minutes at 100% CPU usage on a i5 @ 2.6Ghz...

Here's my implementation

public static double GetSimilarity(string firstString, string secondString)
{
    if (ReferenceEquals(firstString, null)) throw new ArgumentNullException("firstString");
    if (ReferenceEquals(secondString, null)) throw new ArgumentNullException("secondString");
    if (firstString == secondString) return 100;

    return (1 - GetLevensteinDistance(firstString, secondString) / (double)Math.Max(firstString.Length, secondString.Length)) * 100;
}

private static int GetLevensteinDistance(string firstString, string secondString)
{
    if (ReferenceEquals(firstString, null)) throw new ArgumentNullException("firstString");
    if (ReferenceEquals(secondString, null)) throw new ArgumentNullException("secondString");
    if (firstString == secondString) return 1;

    int[,] matrix = new int[firstString.Length + 1, secondString.Length + 1];

    for (int i = 0; i <= firstString.Length; i++)
        matrix[i, 0] = i; // deletion
    for (int j = 0; j <= secondString.Length; j++)
        matrix[0, j] = j; // insertion

    for (int i = 0; i < firstString.Length; i++)
        for (int j = 0; j < secondString.Length; j++)
            if (firstString[i] == secondString[j])
                matrix[i + 1, j + 1] = matrix[i, j];
            else
            {
                matrix[i + 1, j + 1] = Math.Min(matrix[i, j + 1] + 1, matrix[i + 1, j] + 1); //deletion or insertion
                matrix[i + 1, j + 1] = Math.Min(matrix[i + 1, j + 1], matrix[i, j] + 1); //substitution
            }
    return matrix[firstString.Length, secondString.Length];
}

So do you know a similar algo which is perhaps more appropriate for long text comparison or highly parallelizable ?

Emmanuel Istace
  • 1,209
  • 2
  • 14
  • 32
  • Have you tried [this](http://www.dotnetperls.com/levenshtein)? Maybe it is faster – I4V Apr 16 '13 at 11:48
  • 3
    If the CPU is running at 100% more parallelizing isn't going to help. You could try another algorithm, maybe Jaro-Winkler? – cgotberg Apr 16 '13 at 11:49
  • This question and its answers might give you some hints towards faster algorithms. http://stackoverflow.com/questions/4868969/implementing-a-simple-trie-for-efficient-levenshtein-distance-calculation-java – spender Apr 16 '13 at 12:01
  • 1
    What is the length of strings? (roughly - <10, <100, <1000, ...) – maxim1000 Apr 16 '13 at 13:11
  • As @cgotberg pointed out, a "Highly parallelizable Levenstein Distance Algorithm" won't help once the number of threads exceeds the number of CPU cores. Have you tried limiting the number of threads to the number of CPUs? – mbeckish Apr 17 '13 at 13:11
  • Do you process each string in separate thread? If yes - having a lot of threads isn't good at all for performance, better to run 2-8 threads. – alex.b Apr 17 '13 at 13:12

2 Answers2

5

What you are actually doing is using the Needleman-Wunsch (NW) algorithm right?
NW algorithm is based upon calculating DP matrix where each field depends on 3 neighbouring fields: to the left, topleft and top. Because of that it is ussualy solved row by row.
However if you would like to parallelize it then one of most common ideas is to solve DP matrix by antidiagonals. That way you can calculate each field in an antidiagonal independently.

This is the way how your function getLevensetinDistance is working now: you calculate row by row which means you have to calculate field by field and no parallelization is possible, as shown on picture:

enter image description here

You need to change your function getLevesteinDistance in order to be able to parallelize it. Here is a picture of antidiagonal idea that i described earlier, where each field in an antidiagonal can be calculated independently which means you can do paralellization (fields with same number can be calculated in parallel):

enter image description here

Could you explain how do you call your algorithm in parallel? Since you function getLevensteinDistance() accepts two strings I don't see any sense in calling it in parallel except if comparing multiple pairs of strings, but you already mentioned you call in parallel your function compare() for that.

By the way, it should be Levenshtein, not Levenstein :).

Also, I actually ended up implementing a C/C++ library for Levenshtein distance that is among the fastest implementations for longer strings, you can check it out here: https://github.com/Martinsos/edlib - maybe that is a better option than implementing your own, although it works on only one thread (but you could run it on multiple threads by yourself).

Martinsos
  • 1,663
  • 15
  • 31
  • What I do is : I have a list of string, foreach of these string I "start a thread" and within each of these threads I start a new one for each comparison of the current string to the other. – Emmanuel Istace Apr 17 '13 at 09:38
  • 1
    @EmmanuelIstace Ok, and how many threads do you have in average working? If you have many strings than I believe just one level of parallelization could do more good then two levels of parallelization, because if you for example have processor with 4 cores than best number of threads working at same time should be about 4. If you have for example 8 or 10 threads on 4-core processor you will actually experience decline in performance because processor spends to much time handling the threads, and majority of them is not actually working concurently – Martinsos Apr 17 '13 at 11:05
  • Yeah I agree but it's not "classical" thread. I let the TPL manage them and just call a Parallel.ForEach. – Emmanuel Istace Apr 17 '13 at 13:07
  • @EmmanuelIstace - Are you sure that the TPL is limiting the number of threads to match the number of cores? I believe the default might be 2 threads per core? – mbeckish Apr 17 '13 at 13:15
  • @EmmanuelIstace I agree with mbeckish, you should explore how does TPL work, how many threads does it create to run concurrently. It would help if you could show us some code where you call paralellization. Also I wouldn't call what you are doing a paralellization on two levels, it is actually a parallelization on one level: you call one thread for each pair of strings – Martinsos Apr 18 '13 at 10:34
0

I've found a supa dupa library called SimMetrics which contains a lot of implementation of similarity algorithm and when I benchmark them there are some great and usefull in my case much more faster.

If you're interested too : http://sourceforge.net/projects/simmetrics/

Emmanuel Istace
  • 1,209
  • 2
  • 14
  • 32