4

I wrote a method that helps to match names that represent the same person but are written in different ways (full name or short version), for example:

Paul Samuelson-Smith and Paul Smith would be considered equal based on my method:

private static boolean equalName(String name_2, String name_1){
        boolean equality1 = true;
        name_1 = name_1.replace("&", " ").replace("-", " ");
        String  []  names1 = name_1.split(" ");
        for (int i = 0; i < names1.length ; i ++) {
            if (!name_2.contains(names1[i])) {equality1 = false; break;}
        }
        boolean equality2 = true;
        name_2 = name_2.replace("&", " ").replace("-", " ");
        String  []  names2 = name_2.split(" ");
        for (int i = 0; i < names2.length ; i ++) {
            if (!name_1.contains(names2[i])) {equality2 = false; break;}
        }
        return equality1 || equality2;
    }

However I still have a problem with what if there is a typo in a name, say Paul Samuelson-Smith and Paull Smith are the same person. My question is is there any API that would help account for possible typos? How can I improve my method?

Aleksei Nikolaevich
  • 325
  • 3
  • 15
  • 40
  • 6
    You might want to check the [Levenstein distance](http://en.wikipedia.org/wiki/Levenshtein_distance). However, in practice I've found that the only thing that works reliably is keeping an "alias" table and checking there. – Benjamin Gruenbaum Oct 18 '13 at 17:57

2 Answers2

4

Possible duplicate

Here is a library that has a few distance algorithms built in: http://sourceforge.net/projects/simmetrics/

Community
  • 1
  • 1
Amir T
  • 2,708
  • 18
  • 21
  • The answer of using simmetrics and how to use it here http://stackoverflow.com/a/7685846/583513 – Philipp May 19 '14 at 12:23
1

Algorithm you need is something that could not just return true/false. E.g. then you compare 'Paula Smith' and 'Paul Smith' and 'Paul Saumelson-Smith' you should choose the best match. Have a look here: http://www.katkovonline.com/2006/11/java-fuzzy-string-matching/ but it is better for classification, so if you need work on a large database and choose the best matches.

kan
  • 28,279
  • 7
  • 71
  • 101