Finding Similar People's Names from Database

Question

I have a table in MySql with names in it. I am trying to, given an input name, find all similar names in the table. I've heard a lot about Levenshtien/Damerau–Levenshtein distance, but it doesn't seem like it would work well for this, I'll explain my reasoning later.

To elaborate:

User inputs a name that could have, say, five words in it. For the sake of this example, say the inputted name is "Juan Manuel Beldad."
I attempt to find similar names in the Database. Say the database includes
1. "Juan Beldad" (missing middle name)
2. "Juan Belded" (Belded not Beldad)
3. "Juan Manuel Sebastian Beldad" (extra middle name)
I return the them in the order of which ever one is closer to the input, in this case, that would be: "Juan Beldad" ,"Juan Belded", "Juan Manuel Sebastian Beldad"

My reasoning for questioning the use of Levenshtien/Damerau–Levenshtein distance in this case is that it wouldn't be able to detect extra names or missing names well. My understanding of Levenshtien distance is that it finds the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other. So, the following would be considered to be the same distance from the original string.

Original string: "Juan Beldad"
Want to find: "Juan Manuel Beldad"
(7 character insertion)
Would also find: "Mike Bell"
(5 character substitution (M-i-k-e-l), 2 character deletion(a-d))

Since both have a distance of 7 edits, "Mike Bell" would be considered an equal distance from "Juan Beldad" as "Juan Manuel Beldad" is.

I was thinking about querying the database removing the middle name(s) on both input and table-side, and then doing Levenshtien/Damerau–Levenshtein distance? Am I overthinking this, and is there a better way to do this?

With something like proper nouns, it's a very subjective decision. You have to decide what similar means based on the results that you want. — Aluan Haddad, Aug 16 '20 at 02:36
Provide sample data and expected output. If possible explain the logic with bullet points. Easier for others to read and react fast. — Sujitmohanty30, Aug 16 '20 at 04:41
Surely 'Juan Beldad' is a closer match than 'Juan Belded' !?!??? — Strawberry, Aug 16 '20 at 07:59

K4M · Accepted Answer · 2020-08-16T06:20:19.200

There are many possible problems you need to consider when matching names. Some of those are:

nicknames (Bob - Robert)
typos
name swap (last name switched with first name)
maiden name
initials
truncated names
phonetically similar name (Jennifer - Jenny)

Damerau–Levenshtein distance is one of the edit distance algorithms you can use. Each algorithm accounts for different operations (character insert, replace, delete, swap etc.) and neither is perfect but each provides a distance between two strings.

You need to decide on how much error is acceptable to you (i.e cutoff for positive matches). The example you gave includes minimum 7 operations. In that many operations, many names will return the same distance.

When comparing names, you should try to make both sides comparable by normalizing them: if one side has only the first letter of first name for example, you should do the same on the other side too so that the edit distance algorithm gives you better result.

Similarly, you can get rid of the middle name if the other side does not have the middle name (and you're okay to ignore cases where a middle name is entered as first name). But a better alternative is to generate all possible first-last name pairs using all words available in a name and see if any of the pairs will produce a better edit distance. You can also compare each word on its own and find the best word combination with the best score (the trade-off is ignoring the typos at word boundaries).

You should also consider using a phonetic similarity algorithm like Double Metaphone in addition to Damerau–Levenshtein and generate a combined score. Phonetic algorithm are designed for specific language family and tries to determine if both names would sound similar in that language family. The result is not reliable on its own (at least my experience was like that) but this combined with an edit distance algorithm will improve your matching.

To reduce the error rate, additional data elements should be considered like ZIP, DOB etc.

In the end, it is all about trade-offs: your intended use case, your acceptable threshold for positive matches, the quality of your data, time/cost limits, etc. For example: you could simply require the first letter of the first name and the first letter of the last name to be the same in addition to Damerau–Levenshtein distance. That will reduce the pool of false-positives with a trade-off ignoring typos at first letters.

Like in many things nowadays, I think the best result in this area could be achieved through a well-trained machine-learning model. I haven't worked in this area for a while so I'm not sure what's out there but you could probably find a good cloud based solution for the best quality matches, for a fee of course, if that's important to you.

You can see an overview of name matching techniques here as further reading.

Thanks for such a thorough answer. Based on what you have said and the article you linked, I think that I'm going to go for removing middle names on both sides, then doing a firstname-to-firstname and lastname-to-lastname distance. — Bubinga, Aug 16 '20 at 06:36

score 0 · Answer 2 · answered Sep 04 '20 at 02:45

I ended up doing Jaro-Winkler Distance with some middle name managing code. I stole my Jaro-Winkler Distance from user leebickmtu here btw. So essentially what I do is:

Remove Middle Name(s) from Input Name and count them
Get all Names from Database that you want to compare to
Remove all Middle Name(s) from Database Names and count them
Run Jaro-Winkler on Input Name w/o middle name(s) to Database names w/o middle name(s). Stop here for names below a threshold
For each Middle Name add some value onto the Jaro-Winkler Score for that Name. I kinda randomly chose 1/35th and it seems to work well enough for my purposes.
Sort by Score
Return to Database with (now shorter)sorted names list and get any extra information you want.

Finding Similar People's Names from Database

2 Answers2