Fuzzy lookup in SQL to match names

Question

I am stuck at a problem where I need to populate historical data using Fuzzy match. I'm using SQL Server 2014 Developer Edition

MainTbl.UNDERWRITER_CODE is where data needs to be populated in place of NULL. This data needs to be from LKP table. The Matching criteria is MainTbl.UNDERWRITER_NAME with LKP.UNDERWRTIER_NAME

sample:

CREATE TABLE MainTbl(UNDERWRITER_CODE int,  UNDERWRITER_NAME varchar(100))
INSERT INTO MainTbl VALUES
(NULL,'dylan.campbell'),
(NULL,'dylanadmin'),
(NULL,'dylanc'),
(002,'Dylan Campbell'),
(002,'dylan.campbell'),
(002,'dylanadmin'),
(NULL,'scott.noffsinger'),
(001,'Scott Noffsinger')


CREATE TABLE LKP(UNDERWRITER_CODE int,  UNDERWRITER_NAME varchar(100))
INSERT INTO LKP VALUES
(002,'Dylan Campbell'),
(001,'Scott Noffsinger')

expected output:

2 dylan.campbell
2 dylanadmin
2 dylanc
2 Dylan Campbell
2 dylan.campbell
2 dylanadmin
1 scott.noffsinger
1 Scott Noffsinger

@MattEvans : How am I suppose to implement it? Can you briefly shareit? I have around hundred thousand rows which needs to be affected — Prabhat G, Feb 10 '20 at 11:06
This isn't really SQL Server's forté, if I'm honest. You could use something like [`SOUNDEX`](https://learn.microsoft.com/en-us/sql/t-sql/functions/soundex-transact-sql?view=sql-server-ver15) or [`DIFFERENCE`](https://learn.microsoft.com/en-us/sql/t-sql/functions/difference-transact-sql?view=sql-server-ver15). Perhaps relying on that `DIFFERENCE` [returns a value of 3 or 4](https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=8bdfd27152219565435cf6d36977279c)? We don't really have a lot of samples to base on. — Thom A, Feb 10 '20 at 11:08
What if there's a 'Dylan Smith' in your lookup? What does dylanadmin get assigned to? — Nick.Mc, Feb 10 '20 at 11:28
The first thing I would check: Use a `GROUP BY` to find out all variations of UNDERWRITER_NAME where the code is NULL. I would push this list into a mapping table and find one common expression in a semi-automatic way (mixture of automation and manual corrections). Finally I'd use this cleaned mapping for the JOIN. — Shnugo, Feb 10 '20 at 11:54

score 2 · Answer 1 · answered Feb 10 '20 at 11:31

2

SQL is not really designed for such fuzzy string comparisons. However, SQL Server has a function called difference(), which works for your data:

select mt.*, l.*
from maintbl mt outer apply
     (select top (1) lkp.*
      from lkp
      order by difference(mt.underwriter_name, lkp.underwriter_name) desc
     ) l;

Here is a db<>fiddle.

answered Feb 10 '20 at 11:31

Gordon Linoff

1,242,037
58
646
786

Thank you. But it breaks for many of the records (not mentioned in example above). Guess that is because of output of difference. – Prabhat G Feb 10 '20 at 12:26
@PrabhatG . . . Read the first sentence of the answer. – Gordon Linoff Feb 10 '20 at 13:06

Jay Shankar Gupta · Answer 2 · 2020-02-10T11:43:50.260

UPDATE T1 SET T1.UNDERWRITER_CODE = T2.UNDERWRITER_CODE
FROM MainTbl T1 
INNER JOIN LKP T2
ON T1.UNDERWRITER_NAME LIKE CONCAT('%', LEFT( LOWER(T2.UNDERWRITER_NAME)
                                             ,CHARINDEX(' '
                                                        ,LOWER(T2.UNDERWRITER_NAME)
                                                       ) - 1
                                            )
                                      , '%'
                                   )

Output

https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=23a3a55cc1ab1741f6e70dd210db0471

Explanation

Step 1:

SELECT *
       ,CONCAT('%', LEFT( LOWER(T2.UNDERWRITER_NAME)
                                             ,CHARINDEX(' '
                                                        ,LOWER(T2.UNDERWRITER_NAME)
                                                       ) - 1
                                            )
                                      , '%'
                                   ) AS JOIN_COL
FROM LKP T2

Output of above Query

UNDERWRITER_CODE    UNDERWRITER_NAME    JOIN_COL
2                   Dylan Campbell      %dylan%
1                   Scott Noffsinger    %scott%

Used the above JOIN_COL data format in join condion with like operator Step 2:

SELECT T2.UNDERWRITER_CODE,T1.UNDERWRITER_NAME
FROM MainTbl T1
INNER JOIN LKP T2
ON T1.UNDERWRITER_NAME LIKE CONCAT('%', LEFT( LOWER(T2.UNDERWRITER_NAME)
                                             ,CHARINDEX(' '
                                                        ,LOWER(T2.UNDERWRITER_NAME)
                                                       ) - 1
                                            )
                                      , '%'
                                   )

Output of above query:

UNDERWRITER_CODE    UNDERWRITER_NAME
2                   dylan.campbell
2                   dylanadmin
2                   dylanc
2                   Dylan Campbell
2                   dylan.campbell
2                   dylanadmin
1                   scott.noffsinger
1                   Scott Noffsinger

Can you kindly explain your joining logic? – Prabhat G Feb 10 '20 at 11:21 — Prabhat G, Feb 10 '20 at 11:21
@PrabhatG can you goto above link for explanation – Jay Shankar Gupta Feb 10 '20 at 11:26 — Jay Shankar Gupta, Feb 10 '20 at 11:26
How does that **explain** the logic, @JayShankarGupta ? – Thom A Feb 10 '20 at 11:29 — Thom A, Feb 10 '20 at 11:29
@PrabhatG added explanation – Jay Shankar Gupta Feb 10 '20 at 11:44 — Jay Shankar Gupta, Feb 10 '20 at 11:44

Alan Burstein · Answer 3 · 2020-02-11T05:09:09.343

First, fuzzy lookup is a little vague. There are a number of algorithms that are used for fuzzy matching including the Levenshtein Distance, Longest Common Subsequence, and some others referenced in the "See Also" section of this Wikipedia page about Approximate String Matching.

To rephrase what you are attempting to do. You are updating the UNDERWRITER_CODE column in MainTbl with the UNDERWRITER_CODE that matches the most similar UNDERWRITER_NAME in LKP. Fuzzy algorithms can be used for measuring similarity. Note my post here. For the sample data you provided we can use Phil Factor's T-SQL Levenshtein functions and match based on the lowest Levenshtein value like so:

SELECT TOP (1) WITH TIES
  UNDERWRITER_CODE_NULL = m.UNDERWRITER_CODE,
  LKP_UN = m.UNDERWRITER_NAME, l.UNDERWRITER_NAME, l.UNDERWRITER_CODE,
  MinLev = dbo.LEVENSHTEIN(m.UNDERWRITER_NAME, l.UNDERWRITER_NAME)
FROM       dbo.MainTbl AS m
CROSS JOIN dbo.LKP     AS l
WHERE      m.UNDERWRITER_CODE IS NULL
ORDER BY ROW_NUMBER() OVER (PARTITION BY m.UNDERWRITER_NAME
                      ORDER BY dbo.LEVENSHTEIN(m.UNDERWRITER_NAME, l.UNDERWRITER_NAME))

Returns:

UNDERWRITER_CODE_NULL LKP_UN             UNDERWRITER_NAME   UNDERWRITER_CODE MinLev
--------------------- ------------------ ------------------ ---------------- -----------
NULL                  dylan.campbell     Dylan Campbell     2                1
NULL                  dylanadmin         Dylan Campbell     2                8
NULL                  dylanc             Dylan Campbell     2                8
NULL                  scott.noffsinger   Scott Noffsinger   1                1

We can use this logic to update UNDERWRITE_CODE like so:

WITH FuzzyCompare AS
(
    SELECT TOP (1) WITH TIES
      UNDERWRITER_CODE_NULL = m.UNDERWRITER_CODE,
      LKP_UN = m.UNDERWRITER_NAME, l.UNDERWRITER_NAME, l.UNDERWRITER_CODE,
      MinLev = dbo.LEVENSHTEIN(m.UNDERWRITER_NAME, l.UNDERWRITER_NAME)
    FROM       dbo.MainTbl AS m
    CROSS JOIN dbo.LKP     AS l
    WHERE      m.UNDERWRITER_CODE IS NULL
    ORDER BY ROW_NUMBER() OVER (PARTITION BY m.UNDERWRITER_NAME
                          ORDER BY dbo.LEVENSHTEIN(m.UNDERWRITER_NAME, l.UNDERWRITER_NAME))
)
UPDATE fc
SET    fc.UNDERWRITER_CODE_NULL = fc.UNDERWRITER_CODE
FROM   FuzzyCompare AS fc
JOIN   dbo.MainTbl  AS m ON fc.UNDERWRITER_NAME = m.UNDERWRITER_NAME;

After this update SELECT * FROM dbo.mainTbl Returns:

UNDERWRITER_CODE UNDERWRITER_NAME
---------------- -------------------
2                dylan.campbell
2                dylanadmin
2                dylanc
2                Dylan Campbell
2                dylan.campbell
2                dylanadmin
1                scott.noffsinger
1                Scott Noffsinger

This should get you started; depending on the amount & kind of data you are dealing with, you will need to be very selective about what algorithms you use. Do your homework and test, test ,test!

Let me know if you have questions.

Fuzzy lookup in SQL to match names

3 Answers3