Similarities in Strings for name matching

Question

I am trying to decide if two different restaurant names are similar to be able to match them. The names might be misspelled or the parts of the title might be in the wrong order.

In some cases it is a simple as matching: "Angry Diner" with "Angry Diner Restaurant". or "Burger King" with "Burgor King"

A harder case that I have found is: "Mathias Dahlgren Matbaren" and "Restaurant Mathias Dahlgren"

I have looked at several different fuzzy string difference algorithms but have not found one for this use case.

Anyone who knows about an algorithm and/or library I can use?

Depending on what exactly you want to do, your question can be seen as a duplicate of the following question http://stackoverflow.com/questions/29321760/how-to-check-a-partial-similarity-of-two-strings-in-c-sharp/29322466?noredirect=1#comment46836018_29322466 — Codor, Apr 06 '15 at 18:02
I have looked at and tried Levenshtein distance, but it does not work well when words have been shuffled around. — Heinrisch, Apr 06 '15 at 18:04
Do you mean that, for example, the distance between "Burgor King" and "King Burger" should be smaller than the Levenshtein distance? — Codor, Apr 06 '15 at 18:06
Can you remove words like `{ "restaurant", "place", "palace", ... }` from both words before applying the fuzzy string difference algorithm? — higuaro, Apr 06 '15 at 18:08
@Codor for that example Levenshtein distance is probably the best. However, it does not perform well at all on the harder one. — Heinrisch, Apr 07 '15 at 07:34
@higuaro That is what I ended up doing and it works better than any other solution that I have found. Thanks! — Heinrisch, Apr 07 '15 at 07:35
Furthermore, you could normalize the names internally by splitting them up into individual words, sort them and match similarity afterwards; this way, "Burger King" and "King Burger" will have distance zero. — Codor, Apr 07 '15 at 10:35

score 3 · Answer 1 · answered Apr 15 '15 at 16:12

First of all: you will get much better results if you have more than just the name to match on, such as the address. Then you can use a record linkage engine to consider the evidence from all the properties. Using just the name is going to give poor precision in most cases.

The first thing you need to consider is if you are likely to see reordering of substrings. That is, "Restaurant Angry Diner" vs "Angry Diner Restaurant". In that case, q-grams, longest common substring and longest common subsequence are all good candidates. For q-grams you can choose between various sub-formulas and matchings.

If you want the order to matter, affine gap would probably be good for this particular. It's similar to Smith-Waterman but doesn't penalize so much for deletions. Basically, the first deletion is expensive, but then later deletions in the same place are less expensive.

As others have suggested, removing common words like "restaurant", "matbaren" etc before matching is likely to improve accuracy.

There are piles of libraries, but since you didn't specify the programming language it's hard to recommend one. What use is Java if you use PHP? Or vice versa?

But please note carefully what I wrote above: name alone is not going to work very well. Even if the names are identical it could still be two totally different restaurants.

score 0 · Answer 2 · edited May 23 '17 at 12:06

You can try the diff algorithm. It creates all possible strings and find the longest common subsequence.

Well, as mentioned above the speed is O(N^3), i've done a longest common subsequence way that is O(m.n) where m and n are the length of str1 and str2, the result is a percentage and it seems to be exactly the same as similar_text percentage but with better performance... here's the 3 functions i'm using.. 

<?php 
function LCS_Length($s1, $s2) 
{ 
  $m = strlen($s1); 
  $n = strlen($s2); 

  //this table will be used to compute the LCS-Length, only 128 chars per string are considered
  $LCS_Length_Table = array(array(128),array(128)); 


  //reset the 2 cols in the table 
  for($i=1; $i < $m; $i++) $LCS_Length_Table[$i][0]=0; 
  for($j=0; $j < $n; $j++) $LCS_Length_Table[0][$j]=0; 

  for ($i=1; $i <= $m; $i++) { 
    for ($j=1; $j <= $n; $j++) { 
      if ($s1[$i-1]==$s2[$j-1]) 
        $LCS_Length_Table[$i][$j] = $LCS_Length_Table[$i-1][$j-1] + 1; 
      else if ($LCS_Length_Table[$i-1][$j] >= $LCS_Length_Table[$i][$j-1]) 
        $LCS_Length_Table[$i][$j] = $LCS_Length_Table[$i-1][$j]; 
      else 
        $LCS_Length_Table[$i][$j] = $LCS_Length_Table[$i][$j-1]; 
    } 
  } 
  return $LCS_Length_Table[$m][$n]; 
} 

function str_lcsfix($s) 
{ 
  $s = str_replace(" ","",$s); 
  $s = ereg_replace("[��������]","e", $s); 
  $s = ereg_replace("[������������]","a", $s); 
  $s = ereg_replace("[��������]","i", $s); 
  $s = ereg_replace("[���������]","o", $s); 
  $s = ereg_replace("[��������]","u", $s); 
  $s = ereg_replace("[�]","c", $s); 
  return $s; 
} 

function get_lcs($s1, $s2) 
{ 
  //ok, now replace all spaces with nothing 
  $s1 = strtolower(str_lcsfix($s1)); 
  $s2 = strtolower(str_lcsfix($s2)); 

  $lcs = LCS_Length($s1,$s2); //longest common sub sequence 

  $ms = (strlen($s1) + strlen($s2)) / 2; 

  return (($lcs*100)/$ms); 
} 
?> 

you can skip calling str_lcsfix if you don't worry about accentuated characters and things like that or you can add up to it or modify it for faster performance, i think ereg is not the fastest way? 
hope this helps. 
Georges

[1] http://php.net/manual/de/function.similar-text.php

[2] String similarity -> Levenshtein distance

score 0 · Answer 3 · answered Apr 08 '15 at 04:58

I think that the best fitting algorithm would be the algorithm for optimal local alignment: Smith-Waterman-Algorithm

penalty("Angry Diner","Angry Diner Restaurant") = 0
penalty("Burger King", "Burgor King") = 1
penalty("Mathias Dahlgren Matbaren", "Restaurant Mathias Dahlgren") = 0

It is a variant of the Levensthein-Algorithm with the difference that insertions/deletions of characters at the beginning/end are not penalized.

Similarities in Strings for name matching

3 Answers3