php : speed up levensthein comparing, 10k + records

Question

In my MySQL table I have the field name, which is unique. However the contents of the field are gathered on different places. So it is possible I have 2 records with a very similar name instead of second one being discarded, due to spelling errors.

Now I want to find those entries that are very similar to another one. For that I loop through all my records, and compare the name to other entries by looping through all the records again. Problem is that there are over 15k records which takes way too much time. Is there a way to do this faster?

this is my code:

for($x=0;$x<count($serie1);$x++)
    {
    for($y=0;$y<count($serie2);$y++)
        {
        $sim=levenshtein($serie1[$x]['naam'],$serie2[$y]['naam']);
        if($sim==1)
            print("{$A[$x]['naam']} --> {$B[$y]['naam']} = {$sim}<br>");
        }
     }
 }

some improvement might be felt if you would avoid counting the elements on each iteration and you would store the number of elements in a variable instead — mishu, May 06 '15 at 15:02

score 2 · Answer 1 · edited May 23 '17 at 12:21

A preamble: such a task will always be time consuming, and there will always be some pairs that slip through. Nevertheless, a few ideas :

1. actually, the algorithm can be (a bit) improved

assuming that $series1 and $series2 have the same values in the same order, you don't need to loop over the whole second array in the inner loop every time. In this use case you only need to evaluate each value pair once - levenshtein('a', 'b') is sufficient, you don't need levenshtein('b', 'a') as well (and neither do you need levenstein('a', 'a'))

under these assumptions, you can write your function like this:

for($x=0;$x<count($serie1);$x++)
{
   for($y=$x+1;$y<count($serie2);$y++) // <-- $y doesn't need to start at 0
    {
      $sim=levenshtein($serie1[$x]['naam'],$serie2[$y]['naam']);
      if($sim==1)
        print("{$A[$x]['naam']} --> {$B[$y]['naam']} = {$sim}<br>");
    }
 }

2. maybe MySQL is faster

there examples in the net for levenshtein() implementations as a MySQL function. An example on SO is here: How to add levenshtein function in mysql?

If you are comfortable with complex(ish) SQL, you could delegate the heavy lifting to MySQL and at least gain a bit of performance because you aren't fetching the whole 16k rows into the PHP runtime.

3. don't do everything at once / save your results

of course you have to run the function once for every record, but after the initial run, you only have to check new entries since the last run. Schedule a chronjob that once every day/week/month.. checks all new records. You would need an inserted_at column in your table and would still need to compare the new names with every other name entry.

3.5 do some of the work onInsert

a) if the wait is acceptable, do a check once a new record should be inserted, so that you either write it to a log oder give a direct feedback to the user. (A tangent: this could be a good use case for an asynchrony task queue like http://gearman.org/ -> start a new process for the check in the background, return with the success message for the insert immediately)

b) PHP has two other function to help with searching for almost similar strings: metaphone() and soundex() . These functions generate abstract hashes that represent how a string will sound when spoken. You could generate (one or both of) these hashes on each insert, store them as a separate field in your table and use simple SQL functions to find records with similar hashes

score 0 · Answer 2 · edited May 23 '17 at 12:30

The trouble with levenshtein is it only compares string a to string b. I built a spelling corrector once that puts all the strings a into a big trie, and that functioned as a dictionary. Then it would look up any string b in that dictionary, finding all nearest-matching words. I did it first in Fortran (!), then in Pascal. It would be easiest in a more modern language, but I suspect php would not make it easy. Look here.

php : speed up levensthein comparing, 10k + records

2 Answers2

1. actually, the algorithm can be (a bit) improved

2. maybe MySQL is faster

3. don't do everything at once / save your results

3.5 do some of the work onInsert