T-SQL - Average % of Matching Characters

Question

I have data coming from two different systems which I input into two different tables in SQL Server 2016. These tables are related with a Key.

Table1

  Key       Name
  ------------------------
    1       Jim Parsons
    2       Steven Woz
    3       Billy Jean
    4       Carol Henderson
    5       Sara Hauder

Table2

   Key      Name
   ---------------------------
    1       Jimmy Parsons 
    2       Steven Wozniak
    3       Bill Jean
    4       Andy Smith
    5       Sara Hauder (a)

What I am hoping to develop is an additional column that will tell me if the Names match. However, if the names are similar, I want to consider it a match.

In order to accomplish this I was thinking I could do something like

select 
    *,  
    case when a.Name like b.Name 1 else 0 
    end as Flag 
from 
    (Table1 a 
join 
    Table2 b on a.Key = b.Key)

But obviously this wouldn't work because even the slightest difference can cause a like statement to be false.

I'm not very familiar with how to find the longest common substring, but I don't actually care what the string is. What I'm thinking is I need to determine the average number/percentage of characters shared by the two fields and then say: if the number of characters shared between a.Name and b.Name is above or equal to the average, then 1 else 0. Not really sure how to go about this however...

AS an example: Sara Hauder should match with Sara Hauder (a), but Carol Henderson should not match with Andy Smith. Note that many Names will sometimes have some different tags at the end in parenthesis which can be filtered out if necessary. However, I am perfectly capable of adding this feature myself if you choose to ignore.

I would 100% implement a [levenshtein distance function in tsql](https://stackoverflow.com/questions/560709/levenshtein-distance-in-t-sql) and use that to give me the Levenshtein Distance between the two strings. Then base your flag on the result of that function tweaking what you consider to be a "match" as appropriate. Levenshtein distance is the tried and true way of doing "fuzzy" string comparison like this (there are others). I would definitely not try to come up with my own algorithm for doing this unless your usecase is VERY specific. — JNevill, Feb 14 '18 at 17:37

score 0 · Answer 1 · answered Feb 14 '18 at 17:40

Here is a real quick and dirty way using built in functions in MSSQL. This will not be perfect by any means, but it will get you pretty close.

DECLARE @table1 TABLE (id int IDENTITY(1,1), name VARCHAR(100))
DECLARE @table2 TABLE (id int IDENTITY(1,1), name VARCHAR(100))

INSERT INTO @table1 (name)
VALUES 
('Jim Parsons'),
('Steven Woz'),
('Billy Jean'),
('Carol Henderson'),
('Sara Hauder')

INSERT INTO @table2 (name)
VALUES 
('Jimmy Parsons'),
('Steven Wozniak'),
('Bill Jean'),
('Andy Smith'),
('Sara Hauder (a)')


SELECT *, 
CASE WHEN SOUNDEX(t1.name) = SOUNDEX(t2.name) THEN 1 ELSE 0 END AS name_match

FROM @table1 t1
INNER JOIN @table2 t2
    ON t1.id = t2.id

My experience is soundex is not very good for matching. Still good answer. — paparazzo, Feb 14 '18 at 18:06

score 0 · Accepted Answer · answered Feb 14 '18 at 18:11

I would do this in .NET with Levenshtein_distance. It is a great measure of text similarity. You can just run it once and as records are added and it does not take very long.

I see you have a comment of the same. I did not read it before posting this answer.

T-SQL - Average % of Matching Characters

2 Answers2