Compare 5000 strings with PHP Levenshtein

Question

I have 5000, sometimes more, street address strings in an array. I'd like to compare them all with levenshtein to find similar matches. How can I do this without looping through all 5000 and comparing them directly with every other 4999?

Edit: I am also interested in alternate methods if anyone has suggestions. The overall goal is to find similar entries (and eliminate duplicates) based on user-submitted street addresses.

Regarding your update, you might need to apply some input cleaning to make your life easier. (e.g.: If you convert 'Ave' to 'Avenue' 'Rd' to 'Road', etc. prior to storage using soundex would become a more realistic option.) — John Parker, Dec 24 '09 at 11:49
How do you define similar addresses? Do you have some max value for Lehvenstein distance that is limit for similarity, etc? — Juha Syrjälä, Dec 24 '09 at 11:50
How about "923 Bird Road, Apt 6" and "23 Bird Road, Apt 6"? Are they similar? Lehvenstein distance would say that they are, after all there is only one character difference, however geographically one could consider them not to be very similar. — Juha Syrjälä, Dec 24 '09 at 11:56
@Juha: That is a very good point. Plus my addresses are grouped within a hundred mile radius. So using raw Levenshtein may be worthless. — phirschybar, Dec 24 '09 at 11:59
And 7==levenshtein("12 Bird Road, Apt 6" "12 Bird Rd. #6"). Given that the shorter string is 14 characters long that's not very similar ;-) — VolkerK, Dec 24 '09 at 11:59
@VolkerK: Also true. Ok then. String comparison (while cool) is not the smart way to do it. Any suggestions or know of any libraries (php) which help to parse out (diagram) street addresses into parts perhaps?? — phirschybar, Dec 24 '09 at 12:01
@phirschybar: Could you give a larger set of examples of different kinds of addressess and their variations? — Juha Syrjälä, Dec 24 '09 at 12:10
@Juha: Yes. In fact, grouping the similar results is the first step and eliminating them manually is the second. My client is looking to have that manual control in the second step. So, just being able to give them groups of the closely matching addresses will go a long way. — phirschybar, Dec 24 '09 at 12:59
@phirschybar: Have you actually implemented the algorithm where you compare all addresses with Lehvenstein distance? How long it does actually take? — Juha Syrjälä, Dec 24 '09 at 13:54
@Juha: I have not yet implemented it. Still strategizing on the best approach. — phirschybar, Dec 24 '09 at 14:51
Can you limit the addresses to a certain area in the world, i.e. are there only _x_ "common" formats? Google provides a geocoding service where you send a (more or less) free-form address-string and receive structured information about the address ( http://code.google.com/apis/maps/documentation/services.html#Geocoding ). But for thousands of addresses .... I don't think so, unless you pay them :-) — VolkerK, Dec 25 '09 at 10:37

score 7 · Accepted Answer · edited Jun 18 '11 at 14:21

I think a better way to group similar addresses would be to:

create a database with two tables - one for the address (and a id), one for the soundexes of words or literal numbers in the address (with the foreign key of the addresses table)
uppercase the address, replace anything other than [A-Z] or [0-9] with a space
split the address by space, calculate the soundex of each 'word', leave anything with just digits as is and store it in the soundexes table with the foreign key of the address you started with

for each address (with id $target) find the most similar addresses:

SELECT similar.id, similar.address, count(*) 
FROM adress similar, word cmp, word src
WHERE src.address_id=$target
AND src.soundex=cmp.soundex
AND cmp.address_id=similar.id
ORDER BY count(*)
LIMIT $some_value;

calculate the levenstein difference between your source address and the top few values returned by the query.

(doing any sort of operation on large arrays is often faster in databases)

I would use normalized addresses in above algoritm. That is, addresses where abbreviations have been removed etc. ( "Ave." changed to "Avenue" etc) — Juha Syrjälä, Dec 24 '09 at 12:13

score 3 · Answer 2 · answered Dec 24 '09 at 11:41

3

I think you cannot avoid looping through the array as the levenstein() function takes only strings and not an array as input.

You can do something like:

for($i=0;$i<count($array)-1;$i++)
{
    for($j=$i+1;$j<count($array);$j++)
    {
        $lev = levenshtein($array[$i],$array[$j]);
        if($lev == 0)
        {
            // exact match
        }
        else if($lev <= THRESHOLD)
        {
            // similar
        }
    }
}

answered Dec 24 '09 at 11:41

codaddict

445,704
82
492
529

1

I am afraid of this as it would do 25 million comparisons. – phirschybar Dec 24 '09 at 11:47
1

@phirschybar: Actually it will be 12.5 Million, we are comparing only distinct pairs, if pair(X,Y) is compared, we skip comparing pair(Y,X). – codaddict Dec 24 '09 at 11:51

VolkerK · Answer 3 · 2009-12-24T12:03:35.793

You can use a bk-tree to speed-up the search/comparison.

http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees says:

Now we can make a particularly useful observation about the Levenshtein Distance: It forms a Metric Space.
[...]
Assume for a moment we have two parameters, query, the string we are using in our search, and n the maximum distance a string can be from query and still be returned. Say we take an arbitary string, test and compare it to query. Call the resultant distance d. Because we know the triangle inequality holds, all our results must have at most distance d+n and at least distance d-n from test.
[...]
Tests show that searching with a distance of 1 queries no more than 5-8% of the tree, and searching with two errors queries no more than 17-25% of the tree - a substantial improvement over checking every node!

edit: But that doesn't help you with your ("12 Bird Road, Apt 6" and "12 Bird Rd. #6") problem. Only with your brute-force m*n comparison.

score 2 · Answer 4 · answered Dec 24 '09 at 11:42

Due to the nature of the Levenshtein algorithm (specifically the fact that it's a comparision between two strings), I can't see how this is possible.

You could of course reduce the number of comparisons by performing some basic matching requirements first, but this is out of the scope of what you're asking.

As a (quite possibly irrelevant) option, you could always use something like soundex which would let you pre-compute the string values. (You can also use it directly in MySQL I believe.)

score 2 · Answer 5 · answered Dec 24 '09 at 11:42

2

You could group them based on soundexes then limit the comparisons to the nearest N cases...

 $mashed=array();
 foreach ($address as $key=>$val) {
      $mashed[$key]=soundex($val);
 }
 sort($mashed);

Then iterate through the keys of $mashed.

C.

answered Dec 24 '09 at 11:42

symcbean

47,736
6
59
94

I have never worked with soundexes. Any idea how well they work with street address type abbreviations like "St." ? – phirschybar Dec 24 '09 at 11:44
Soundex ( http://en.wikipedia.org/wiki/Soundex ) is designed to work with people names. If you apply them to address, it makes sense to apply soundex algorithm to each part of address e.g "23 Bird Road" -> SOUNDEX("Bird") and SOUNDEX("Road") – Juha Syrjälä Dec 24 '09 at 12:07
This solution would be something like `O(2n)` or `O(2nm)` (both unsimplified), where `m` is each word in the address. This is a great improvement over `O(n^2)`. – Justin Johnson Dec 24 '09 at 21:56
It can't interpolate abbreviations - but St could be Street or Saint or...? depending on context. C. – symcbean Dec 28 '09 at 01:22

Juha Syrjälä · Answer 6 · 2009-12-24T12:03:03.080

Given you problem, I see no other way than to compare every address with every other address if you want to use Lehvenstein distance.

First of all, you should normalize addressess, get rid of abbreviations etc.

Ave -> Avenue
Rd. -> Road

You could have some fixed max Lehvenstein distance ( N ) for similar addresses.

If so, you could you could abort the Lehvenstein algorithm when you are sure that the edit distance for current address pair is larger than N. For this you need to write a custom version of Lehvenstein algorithm. This will make the algorithm a little quicker.

There are also some related trivial optimizations. For example: if address A is 10 chars long and address B is 20 chars long and you consider addresses that have Lehvenstein distance of less than 8 to be similar. You can look lengths of addresses and immediately decide that they are not similar.

score 1 · Answer 7 · answered Dec 24 '09 at 11:48

If you want to find all similar values, you will have to compare all items to all others. But choosing the right array functions will speed things up significantly. Here is a quick example (the results array could have been better):

$results = array();
$count = count($entries);
while ($count != 0) {
    # The entry to process
    $entry = array_shift($entries);
    # Get levenshtein distances to all others
    $result = array_map(
        'levenshtein',
        # array_map() needs two arrays, this one is an array consisting of
        # multiple entries of the value that we are processing
        array_fill($entry, 0, $count),
        $toCompare
    );
    $results[] = array($entry => $result);
    $count--;
}

score 1 · Answer 8 · edited Jun 18 '11 at 14:46

$stringA = "this is php programming language";
$stringB = "this is complete programming script in which java php and  all other minor languages include";

echo "string 1---->".$stringA."<br />";
echo "string 2---->".$stringB."<br />";
// changing string to arrays
$array1 = explode(' ', $stringA);
$array2 = explode(' ', $stringB);

// getting same element from two strings
$c = array_intersect($array1, $array2);
// changing array to the string
$d=implode(' ',$c);

echo "string same elements---> ".$d."<br />";


// getting difrent  element from two arrays
$result = array_diff($array2, $array1);
// changing array to the string
$zem = implode(' ',$result);

if (!empty($zem)) {
  echo "string diffrence---> ".$zem."<br />";
} else {
  echo "string diffrence--->both strings are same <br />";
}

similar_text($stringA, $d, $p);
echo " similarity between the string is ".$p."% <br />";

Compare 5000 strings with PHP Levenshtein

8 Answers8

Linked