A preamble: such a task will always be time consuming, and there will always be some pairs that slip through.
Nevertheless, a few ideas :
1. actually, the algorithm can be (a bit) improved
assuming that $series1
and $series2
have the same values in the same order, you don't need to loop over the whole second array in the inner loop every time. In this use case you only need to evaluate each value pair once - levenshtein('a', 'b')
is sufficient, you don't need levenshtein('b', 'a')
as well (and neither do you need levenstein('a', 'a')
)
under these assumptions, you can write your function like this:
for($x=0;$x<count($serie1);$x++)
{
for($y=$x+1;$y<count($serie2);$y++) // <-- $y doesn't need to start at 0
{
$sim=levenshtein($serie1[$x]['naam'],$serie2[$y]['naam']);
if($sim==1)
print("{$A[$x]['naam']} --> {$B[$y]['naam']} = {$sim}<br>");
}
}
2. maybe MySQL is faster
there examples in the net for levenshtein() implementations as a MySQL function. An example on SO is here: How to add levenshtein function in mysql?
If you are comfortable with complex(ish) SQL, you could delegate the heavy lifting to MySQL and at least gain a bit of performance because you aren't fetching the whole 16k rows into the PHP runtime.
3. don't do everything at once / save your results
of course you have to run the function once for every record, but after the initial run, you only have to check new entries since the last run. Schedule a chronjob that once every day/week/month.. checks all new records. You would need an inserted_at
column in your table and would still need to compare the new names with every other name entry.
3.5 do some of the work onInsert
a) if the wait is acceptable, do a check once a new record should be inserted, so that you either write it to a log oder give a direct feedback to the user. (A tangent: this could be a good use case for an asynchrony task queue like http://gearman.org/ -> start a new process for the check in the background, return with the success message for the insert immediately)
b) PHP has two other function to help with searching for almost similar strings: metaphone() and soundex() . These functions generate abstract hashes that represent how a string will sound when spoken. You could generate (one or both of) these hashes on each insert, store them as a separate field in your table and use simple SQL functions to find records with similar hashes