I have a table in MySql with names in it. I am trying to, given an input name, find all similar names in the table. I've heard a lot about Levenshtien/Damerau–Levenshtein distance, but it doesn't seem like it would work well for this, I'll explain my reasoning later.
To elaborate:
- User inputs a name that could have, say, five words in it. For the sake of this example, say the inputted name is "Juan Manuel Beldad."
- I attempt to find similar names in the Database. Say the database includes
- "Juan Beldad" (missing middle name)
- "Juan Belded" (Belded not Beldad)
- "Juan Manuel Sebastian Beldad" (extra middle name)
- I return the them in the order of which ever one is closer to the input, in this case, that would be: "Juan Beldad" ,"Juan Belded", "Juan Manuel Sebastian Beldad"
My reasoning for questioning the use of Levenshtien/Damerau–Levenshtein distance in this case is that it wouldn't be able to detect extra names or missing names well. My understanding of Levenshtien distance is that it finds the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other. So, the following would be considered to be the same distance from the original string.
Original string: "Juan Beldad"
Want to find: "Juan Manuel Beldad"
(7 character insertion)
Would also find: "Mike Bell"
(5 character substitution (M-i-k-e-l), 2 character deletion(a-d))
Since both have a distance of 7 edits, "Mike Bell" would be considered an equal distance from "Juan Beldad" as "Juan Manuel Beldad" is.
I was thinking about querying the database removing the middle name(s) on both input and table-side, and then doing Levenshtien/Damerau–Levenshtein distance? Am I overthinking this, and is there a better way to do this?