Determine if two names are close to each other

Question

I'm making a system for my school where we can check if a student is black-listed, at parties and other events. It's easy for me to check if a student is black-listed, since I can just look the student up in my database and see if he/she is black-listed.

Here is where it gets difficult though.

At our parties, each student can invite one person. In theory a student who is black-listed, can be invited by another student and bypass the system. I cannot check the guest table for students black-listed, because only a name is provided when you invite your guest.

So I need to check if a black-listed name is close to a guest name, and display a warning if they are close, unfortunately there are some stuff to take into account.

Names can be quite different. In Denmark, the standard name, contains three "names", like "Niels Faurskov Andersen" But a student may just type "Niels Faurskov" or "Niels Andersen", or even some characters removed.

So a fullname such as Niels Faurskov Andersen could be

Niels Andersen
Niels Faurskov
Niels Faurskov Andersen
Nils Faurskov Andersen
Nils Andersen
niels faurskov
niels Faurskov

And so on...

Another thing is that the Danish alphabet contains "æøå" apart from the usual a-z. With that said the whole site and database is UTF-8 encoded.

I've looked into various methods to check the difference between two strings, and the Levenshtein distance doesn't quite do it.

I found this thread on StackOverflow: Getting the closest string match

Which seemed to provided the right data, however I wasn't quite sure what method too choose

I'm coding this part in php, does anybody have an idea how to do this? maybe with MySQL? or a modified version of the Levenshtein distance? Could regex be possible?

"check if two a student is under quarantine" eh? I think 'quarantine' is 'black-listed' in this context, right? — Strawberry, Jan 27 '14 at 11:24
My instinct is that Levenshtein isn't going to help you - but a REGULAR EXPRESSION might. I wouldn't be surprised if there were some 'standard' Danish name validation expressions out there. — Strawberry, Jan 27 '14 at 11:29
You *might* be able to use MySQL's `SOUNDEX` function ( http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_soundex ) BUT it's not (apparently) entirely reliable in languages other than English or in utf-8 ... so Danish names may be problematic - might be worth taking a look at though. — CD001, Jan 27 '14 at 11:31
Unfortunately, for me REGEX is something of a 'dark art', but others may be able to help. SOUNDEX will, I suspect, be useless for this - but happy to be proven wrong. — Strawberry, Jan 27 '14 at 11:33
@CD001 I'll have to look into that :) Haha, I hope someone can ^^ — Jazerix, Jan 27 '14 at 11:37

score 13 · Accepted Answer · edited Jun 20 '20 at 09:12

Introduction

Quite now your matching conditions may be too broad. However, you can use levenshtein distance to check your words. It may be not too easy to fulfill all desired goals with it, like sound similarity. Thus, I'm suggesting to split your issue into some other issues.

For example, you can create some custom checker which will use passed callable input which takes two strings and then answering question about are they same (for levenshtein that will be distance lesser than some value, for similar_text - some percent of similarity e t.c. - it's up to you to define rules).

Similarity, based on words

Well, all of built-in functions will fail if we are talking about case when you're looking for partial match - especially if it's about non-ordered match. Thus, you'll need to create more complex comparison tool. You have:

Data string (that will be in DB, for example). It looks like D = D₀ D₁ D₂ ... D_n
Search string (that will be user input). It looks like S = S₀ S₁ ... S_m

Here space symbols means just any space (I assume that space symbols will not affect similarity). Also n > m. With this definition your issue is about - to find set of m words in D which will be similar to S. By set I mean any unordered sequence. Hence, if we'll found any such sequence in D, then S is similar to D.

Obviously, if n < m then input contains more words than data string. In this case you may either think that they are not similar or act like above, but switch data and input (that, however, looks a little bit odd, but is applicable in some sense)

Implementation

To do the stuff, you'll need to be able to create set of string which are parts from m words from D. Based on my this question you can do this with:

protected function nextAssoc($assoc)
{
   if(false !== ($pos = strrpos($assoc, '01')))
   {
      $assoc[$pos]   = '1';
      $assoc[$pos+1] = '0';
      return substr($assoc, 0, $pos+2).
             str_repeat('0', substr_count(substr($assoc, $pos+2), '0')).
             str_repeat('1', substr_count(substr($assoc, $pos+2), '1'));
   }
   return false;
}

protected function getAssoc(array $data, $count=2)
{
   if(count($data)<$count)
   {
      return null;
   }
   $assoc   = str_repeat('0', count($data)-$count).str_repeat('1', $count);
   $result = [];
   do
   {
      $result[]=array_intersect_key($data, array_filter(str_split($assoc)));
   }
   while($assoc=$this->nextAssoc($assoc));
   return $result;
}

-so for any array, getAssoc() will return array of unordered selections consisting from m items each.

Next step is about order in produced selection. We should search both Niels Andersen and Andersen Niels in our D string. Therefore, you'll need to be able to create permutations for array. It's very common issue, but I'll put my version here too:

protected function getPermutations(array $input)
{
   if(count($input)==1)
   {
      return [$input];
   }
   $result = [];
   foreach($input as $key=>$element)
   {
      foreach($this->getPermutations(array_diff_key($input, [$key=>0])) as $subarray)
      {
         $result[] = array_merge([$element], $subarray);
      }
   }
   return $result;
}

After this you'll be able to create selections of m words and then, permutating each of them, get all variants for compare with search string S. That comparison each time will be done via some callback, such as levenshtein. Here's sample:

public function checkMatch($search, callable $checker=null, array $args=[], $return=false)
{
   $data   = preg_split('/\s+/', strtolower($this->data), -1, PREG_SPLIT_NO_EMPTY);
   $search = trim(preg_replace('/\s+/', ' ', strtolower($search)));
   foreach($this->getAssoc($data, substr_count($search, ' ')+1) as $assoc)
   {
       foreach($this->getPermutations($assoc) as $ordered)
       {
           $ordered = join(' ', $ordered);
           $result  = call_user_func_array($checker, array_merge([$ordered, $search], $args));
           if($result<=$this->distance)
           {
               return $return?$ordered:true;
           }
       }
   }
   
   return $return?null:false;
}

This will check on similarity, based on user callback, which must accept at least two parameters (i.e. compared strings). Also you may wish to return string which triggered callback positive return. Please, note, that this code will not differ upper and lower case - but may be you do not want such behavior (then just replace strtolower()).

Sample of full code is available in this listing (I didn't used sandbox since I'm not sure about how long code listing will be available there). With this sample of usage:

$data   = 'Niels Faurskov Andersen';
$search = [
    'Niels Andersen',
    'Niels Faurskov',
    'Niels Faurskov Andersen',
    'Nils Faurskov Andersen',
    'Nils Andersen',
    'niels faurskov',
    'niels Faurskov',
    'niffddels Faurskovffre'//I've added this crap
];

$checker = new Similarity($data, 2);

echo(sprintf('Testing "%s"'.PHP_EOL.PHP_EOL, $data));
foreach($search as $name)
{
   echo(sprintf(
      'Name "%s" has %s'.PHP_EOL, 
      $name, 
      ($result=$checker->checkMatch($name, 'levenshtein', [], 1))
         ?sprintf('matched with "%s"', $result)
         :'mismatched'
      )
   );

}

you'll get result like:

Testing "Niels Faurskov Andersen"

Name "Niels Andersen" has matched with "niels andersen"
Name "Niels Faurskov" has matched with "niels faurskov"
Name "Niels Faurskov Andersen" has matched with "niels faurskov andersen"
Name "Nils Faurskov Andersen" has matched with "niels faurskov andersen"
Name "Nils Andersen" has matched with "niels andersen"
Name "niels faurskov" has matched with "niels faurskov"
Name "niels Faurskov" has matched with "niels faurskov"
Name "niffddels Faurskovffre" has mismatched

-here is demo for this code, just in case.

Complexity

Since you're caring about not just any methods, but also about - how good is it, you may notice, that such code will produce quite excessive operations. I mean, at least, generation of string parts. Complexity here consists of two parts:

Strings parts generation part. If you want to generate all string parts - you'll have to do this like I've described above. Possible point to improve - generation of unordered string sets (that comes before permutation). But still I doubt it can be done because method in provided code will generate them not with "brute-force", but as they are mathematically calculated (with cardinality of )
Similarity checking part. Here your complexity depends of given similarity checker. For example, similar_text() has O(N³) complexity, thus with large comparison sets it will be extremely slow.

But you still may improve current solution with checking on the fly. Now this code will first generate all string sub-sequences and then start checking them one by one. In common case you don't need to do that, so you may want to replace that with behavior, when after generating next sequence it will be checked immediately. Then you'll increase performance for strings which have positive answer (but not for those which have no match).

Sounds really good, I looked into similiar_text earlier, it also looked like it could be used in situations like this. Thank you for your input. I'm not able to test it yet, but I will get back as soon as I am ^^ — Jazerix, Jan 27 '14 at 18:20
Reading through your answer, you say obviously a couple of times. I'm not sure what n and m is in this context, do you think you could elaborate it? :) — Jazerix, Jan 28 '14 at 09:30
`m` and `n` are count of words for `D` and `S`(definition provided in `Similarity, based on words` part of the answer). Here `D` is data string and `S` is search string — Alma Do, Jan 28 '14 at 09:35
I'm trying out the class right now :). Does it work with Danish characters such as "æøå"? — Jazerix, Jan 28 '14 at 13:37
Actually, that depends from your callback. This class above has intention to only one thing - extended similarity check, based on words. Thus, to maintain various symbols, you may just replace them into others within your callback. Also you may want to replace distance check with some other thing (or, for example, place that check into your callback) — Alma Do, Jan 28 '14 at 14:14
Say search is my black list, which contains a range of names that are black listed. When I change the data string a foreach warning is return http://3v4l.org/gJmcH Marvelous code though :) — Jazerix, Jan 29 '14 at 11:06
I know this is very nooby, but do you think you could add a function that does that. My function is acting up, and returning mismatch if the name is not exactly as in the array. And I don't fully understand n < m :/ This is my function: http://pastebin.com/h3W4za5T — Jazerix, Jan 29 '14 at 16:21
@Jazerix it's simple. If count of words in search string is higher than in data string - then you should decide - is it mismatch or you need to switch search and data. For second option I've added `normalizeInput()` method [here](http://3v4l.org/ndM2U) — Alma Do, Jan 30 '14 at 05:41
I don't code php, I would not have seen this if I weren't reading the [thread on meta](http://meta.stackoverflow.com/questions/252756/are-high-reputation-users-answering-fewer-questions?cb=1). However, this may be one of the top 10 answers I have ever read on this site. +1, I wish it could be more. — durron597, May 01 '14 at 04:24

score 1 · Answer 2 · edited May 23 '17 at 12:17

(had a bit of a think over lunch)

I think, essentially what you're trying to do is not even necessarily find out if two names sound similar but if they have a similar letters in a similar order so I think the best bet might be to "throw away" common characters and just look at the rest. This should be possible with a Regular Expression - and if the names are being stored in a MySQL database, you'll probably want to use REGEXP...

Something like this may serve your purposes assuming you've got an HTML form with a single 'name' field:

1: capture the name and remove common characters (vowels basically but potentially also Danish accented vowels for simplicity in the SQL I'm just going to use 'aeiou') but keep the whitespace for now:

// using 'Niels Faurskov Andersen' as the example...
$sName = str_to_lower( preg_replace( '/[aeiou]/', '', $_POST['name'] ) );

// you should now have 'nls frskv ndrsn'

2: assuming the forename is always first you can build an SQL REGEXP query matching the (remainder) of the forename plus either of the following names:

// taking $sName from (1) 'nls frskv ndrsn'

// explode $sName on whitespace
$aName = explode(' ', $sName);

// if the exploded $sName has more than 1 element assume forename + surname(s)
if(count($aName) > 1) {

  // extract the forename
  $sForename = $aName[0];

  // extract the surname(s)
  $aSurnames = array_shift($aName);

  // build up the name-matching part of the SQL query
  $sNameSQLPattern = $sForename . '\s+(' . implode('\s*|', $aSurnames) . '\s*)';

  // you should now have a REGEXP insert for MySQL like 'nls\s+(frskv\s*|ndrsn\s*)'
  // this will match 'nls' followed by either 'frsky' or 'ndrsn' (or both)
}

// if there are no whitespace characters in the exploded string...
else {
  // ... just use the name as is (with common characters replaced)
  // appearing anywhere in the 'full name'
  $sNameSQLPattern = ".*{$sName}.*";
}

3: query the database

// build the SQL SELECT statement 
// remembering to do the same 'common character' replacement
// unfortunately there's no way to do a RegExp replacement in MySQL...
$sFindNameQuery = "SELECT `blacklist`.`fullname` "
    . "FROM `blacklist` "
    . "WHERE "
    . "REPLACE( "
    . "REPLACE( "
    . "REPLACE( "
    . "REPLACE( "
    . "REPLACE( LOWER(`blacklist`.`fullname`), 'a', '' ), "
    . "'e', ''), "
    . "'i', ''), "
    . "'o', ''), "
    . "'u', '')  "
    . "REGEXP {$sNameSQLPattern} ";

That's ugly as sin but is should essentially give you a regular expression pattern match on a sort of basic "fingerprint" of the users name - it should be fairly forgiving so if there are no matches you can (reasonably) safely assume the person hasn't been blacklisted but if there are one or more matches they can be pulled up for manual review.

When it comes to removing accented characters, you could use iconv in the PHP to transliterate those characters to ASCII - which is fine for building a fingerprint: http://www.php.net/iconv

Unfortunately you'd then need to match that up in the SQL - and to do that you'd be better off putting the whole character replacement (that 'REPLACE' block) into a function as you're going to need to map a lot of replacements: How to remove accents in MySQL?

Remember though, whatever replacements you make in the PHP side you also have to make in the database query - so it would probably be better to create both a PHP function and a MySQL function that essentially mirror each other's functionality.

Hope this is of some help... it's a bit rambling :\

That sounds perfect, I'll give it a try later when I get home, and get back,thank you :D — Jazerix, Jan 27 '14 at 13:41

Determine if two names are close to each other

2 Answers2

Introduction

Similarity, based on words

Implementation

Complexity

Linked