3

I am trying to decide if two different restaurant names are similar to be able to match them. The names might be misspelled or the parts of the title might be in the wrong order.

In some cases it is a simple as matching: "Angry Diner" with "Angry Diner Restaurant". or "Burger King" with "Burgor King"

A harder case that I have found is: "Mathias Dahlgren Matbaren" and "Restaurant Mathias Dahlgren"

I have looked at several different fuzzy string difference algorithms but have not found one for this use case.

Anyone who knows about an algorithm and/or library I can use?

Heinrisch
  • 5,835
  • 4
  • 33
  • 43
  • Depending on what exactly you want to do, your question can be seen as a duplicate of the following question http://stackoverflow.com/questions/29321760/how-to-check-a-partial-similarity-of-two-strings-in-c-sharp/29322466?noredirect=1#comment46836018_29322466 – Codor Apr 06 '15 at 18:02
  • I have looked at and tried Levenshtein distance, but it does not work well when words have been shuffled around. – Heinrisch Apr 06 '15 at 18:04
  • Do you mean that, for example, the distance between "Burgor King" and "King Burger" should be smaller than the Levenshtein distance? – Codor Apr 06 '15 at 18:06
  • Can you remove words like `{ "restaurant", "place", "palace", ... }` from both words before applying the fuzzy string difference algorithm? – higuaro Apr 06 '15 at 18:08
  • @Codor for that example Levenshtein distance is probably the best. However, it does not perform well at all on the harder one. – Heinrisch Apr 07 '15 at 07:34
  • @higuaro That is what I ended up doing and it works better than any other solution that I have found. Thanks! – Heinrisch Apr 07 '15 at 07:35
  • Furthermore, you could normalize the names internally by splitting them up into individual words, sort them and match similarity afterwards; this way, "Burger King" and "King Burger" will have distance zero. – Codor Apr 07 '15 at 10:35

3 Answers3

3

First of all: you will get much better results if you have more than just the name to match on, such as the address. Then you can use a record linkage engine to consider the evidence from all the properties. Using just the name is going to give poor precision in most cases.

The first thing you need to consider is if you are likely to see reordering of substrings. That is, "Restaurant Angry Diner" vs "Angry Diner Restaurant". In that case, q-grams, longest common substring and longest common subsequence are all good candidates. For q-grams you can choose between various sub-formulas and matchings.

If you want the order to matter, affine gap would probably be good for this particular. It's similar to Smith-Waterman but doesn't penalize so much for deletions. Basically, the first deletion is expensive, but then later deletions in the same place are less expensive.

As others have suggested, removing common words like "restaurant", "matbaren" etc before matching is likely to improve accuracy.

There are piles of libraries, but since you didn't specify the programming language it's hard to recommend one. What use is Java if you use PHP? Or vice versa?

But please note carefully what I wrote above: name alone is not going to work very well. Even if the names are identical it could still be two totally different restaurants.

larsga
  • 654
  • 6
  • 10
0

You can try the diff algorithm. It creates all possible strings and find the longest common subsequence.

Well, as mentioned above the speed is O(N^3), i've done a longest common subsequence way that is O(m.n) where m and n are the length of str1 and str2, the result is a percentage and it seems to be exactly the same as similar_text percentage but with better performance... here's the 3 functions i'm using.. 

<?php 
function LCS_Length($s1, $s2) 
{ 
  $m = strlen($s1); 
  $n = strlen($s2); 

  //this table will be used to compute the LCS-Length, only 128 chars per string are considered
  $LCS_Length_Table = array(array(128),array(128)); 


  //reset the 2 cols in the table 
  for($i=1; $i < $m; $i++) $LCS_Length_Table[$i][0]=0; 
  for($j=0; $j < $n; $j++) $LCS_Length_Table[0][$j]=0; 

  for ($i=1; $i <= $m; $i++) { 
    for ($j=1; $j <= $n; $j++) { 
      if ($s1[$i-1]==$s2[$j-1]) 
        $LCS_Length_Table[$i][$j] = $LCS_Length_Table[$i-1][$j-1] + 1; 
      else if ($LCS_Length_Table[$i-1][$j] >= $LCS_Length_Table[$i][$j-1]) 
        $LCS_Length_Table[$i][$j] = $LCS_Length_Table[$i-1][$j]; 
      else 
        $LCS_Length_Table[$i][$j] = $LCS_Length_Table[$i][$j-1]; 
    } 
  } 
  return $LCS_Length_Table[$m][$n]; 
} 

function str_lcsfix($s) 
{ 
  $s = str_replace(" ","",$s); 
  $s = ereg_replace("[��������]","e", $s); 
  $s = ereg_replace("[������������]","a", $s); 
  $s = ereg_replace("[��������]","i", $s); 
  $s = ereg_replace("[���������]","o", $s); 
  $s = ereg_replace("[��������]","u", $s); 
  $s = ereg_replace("[�]","c", $s); 
  return $s; 
} 

function get_lcs($s1, $s2) 
{ 
  //ok, now replace all spaces with nothing 
  $s1 = strtolower(str_lcsfix($s1)); 
  $s2 = strtolower(str_lcsfix($s2)); 

  $lcs = LCS_Length($s1,$s2); //longest common sub sequence 

  $ms = (strlen($s1) + strlen($s2)) / 2; 

  return (($lcs*100)/$ms); 
} 
?> 

you can skip calling str_lcsfix if you don't worry about accentuated characters and things like that or you can add up to it or modify it for faster performance, i think ereg is not the fastest way? 
hope this helps. 
Georges

[1] http://php.net/manual/de/function.similar-text.php

[2] String similarity -> Levenshtein distance

Community
  • 1
  • 1
Micromega
  • 12,486
  • 7
  • 35
  • 72
0

I think that the best fitting algorithm would be the algorithm for optimal local alignment: Smith-Waterman-Algorithm

penalty("Angry Diner","Angry Diner Restaurant") = 0
penalty("Burger King", "Burgor King") = 1
penalty("Mathias Dahlgren Matbaren", "Restaurant Mathias Dahlgren") = 0

It is a variant of the Levensthein-Algorithm with the difference that insertions/deletions of characters at the beginning/end are not penalized.

CoronA
  • 7,717
  • 2
  • 26
  • 53