0

I need to do a fuzzy name comparison for two names and see if they match. Think of it like you've written a check for "Frank A. Anderson" and when the check is deposited it changed to "Frank Smith". The goal is to see how close the two names match.

I discovered the function Difference() in SQL Server and it seems like that does the job by comparing two strings and return how similar they are. However, I am having a problem understanding why the below query returns the same result:

SELECT 
    DIFFERENCE ('Frank A. Anderson', 'Frank Anderson') AS Good,     
    DIFFERENCE ('Frank A. Anderson', 'Frank Smith') AS Bad

Both Good and Bad returns "4", which I understand is a match. For the first one, sure, missing a middle name "A" is OK. But for the second one, not only it's missing the middle name A, it has a completely different last name. How come it also returns a 4? How can these two comparison returns the same result?

If Difference() is not the right method I need for my purpose, what other suggestions you have? I hate to do a straight string equal as a comma/dot/space would ruin the comparison.

Any help would be appreciated.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Franky
  • 1,262
  • 2
  • 16
  • 31
  • [levenshtein-distance-in-t-sql](https://stackoverflow.com/questions/560709/levenshtein-distance-in-t-sql) – Stu Jan 08 '23 at 16:09
  • Please provide some sample data and the expected result. Talking about just one or two values is very vague. – Jonas Metzler Jan 08 '23 at 16:11
  • 1
    The Difference() functions doesn't calculate the Levenstein distance (which you probably need). For more info on Distance() see https://www.sqlservertutorial.net/sql-server-string-functions/sql-server-difference-function/, for info on Levenstein distance see @Stu comment. – emilyisstewpid Jan 08 '23 at 16:11
  • Aha, Levenshtein distance, I think that's probably what i need. thanks guys, will look into it! – Franky Jan 08 '23 at 16:20
  • `DIFFERENCE` is based on the `SOUNDEX` which only considers the first "word" in the string not all of them. As such, in both cases you are getting the `DIFFERENCE` between `'Franck'` and `Franck'`. – Thom A Jan 08 '23 at 17:02
  • larnu, yep, i realized that now. so difference() is not the function i am looking for. i am researching Levenshtein distance. – Franky Jan 08 '23 at 17:57

0 Answers0