Fuzzy grouping in SQL

Question

I need to modify a SQL table to group slightly mismatched names, and assign all elements in the group a standardized name.

For instance, if the initial table looks like this:

Name
--------
Jon Q
John Q
Jonn Q
Mary W
Marie W
Matt H

I would like to create a new table or add a field to the existing one like this:

Name     | StdName
--------------------
Jon Q    | Jon Q
John Q   | Jon Q
Jonn Q   | Jon Q
Mary W   | Mary W
Marie W  | Mary W
Matt H   | Matt H

In this case, I've chosen the first name to assign as the "standardized name," but I don't actually care which one is chosen -- ultimately the final "standardized name" will be hashed into a unique person ID. (I'm also open to alternative solutions that go directly to a numerical ID.) I will have birthdates to match on as well, so the accuracy of the name matching doesn't actually need to be all that precise in practice. I've looked into this a bit and will probably use the Jaro-Winkler algorithm (see e.g. here).

If I knew that the names were all in pairs, this would be a relatively easy query, but there can be an arbitrary number of the same name.

I can easily conceptualize how to do this query in a procedural language, but I'm not very familiar with SQL. Unfortunately I don't have direct access to the data -- it's sensitive data and so somebody else (a bureaucrat) has to run the actual query for me. The specific implementation will be SQL Server, but I'd prefer an implementation-agnostic solution.

EDIT:

In response to a comment, I had the following procedural approach in mind. It's in Python, and I replaced the Jaro-Winkler with simply matching on the first letter of the name, for the sake of having a working code example.

nameList = ['Jon Q', 'John Q', 'Jonn Q', 'Mary W', 'Marie W', 'Larry H']
stdList = nameList[:]

# loop over all names
for i1, name1 in enumerate(stdList):

  # loop over later names in list to find matches
  for i2, name2 in enumerate(stdList[i1+1:]):

    # If there's a match, replace latter with former.
    if (name1[0] == name2[0]):
      stdList[i1+1+i2] = name1

print stdList

The result is ['Jon Q', 'Jon Q', 'Jon Q', 'Mary W', 'Mary W', 'Larry H'].

So give us that procedural code and we'll try to help you from there. — phadaphunk, May 08 '13 at 19:10
I see there's a [Jaro-Winkler](http://www.sqlservercentral.com/articles/Fuzzy+Match/65702/) TSQL implementation over on SSC. We ended up implementing our matching logic in CLR methods but it might not hurt to try the TSQL approach and see how it scales. — billinkc, May 08 '13 at 19:30
@billinkc I saw that -- that was the function implementation I was planning to use. However, it's still not clear to me how match the names in _groups_ rather than in _pairs_, so that I can assign a single standardized name/ID to the group. — abeboparebop, May 08 '13 at 21:44

score 7 · Answer 1 · answered May 08 '13 at 19:12

Just a thought, but you might be able to use the SOUNDEX() function. This will create a value for the names that are similar.

If you started with something like this:

select name, soundex(name) snd,
  row_number() over(partition by soundex(name)
                    order by soundex(name)) rn
from yt;

See SQL Fiddle with Demo. Which would give a result for each row that is similar along with a row_number() so you could return only the first value for each group. For example, the above query will return:

|    NAME |  SND | RN |
-----------------------
|   Jon Q | J500 |  1 |
|  John Q | J500 |  2 |
|  Jonn Q | J500 |  3 |
|  Matt H | M300 |  1 |
|  Mary W | M600 |  1 |
| Marie W | M600 |  2 |

Then you could select all of the rows from this result where the row_number() is equal to 1 and then join back to your main table on the soundex(name) value:

select t1.name,
  t2.Stdname
from yt t1
inner join
(
  select name as stdName, snd, rn
  from
  (
    select name, soundex(name) snd,
      row_number() over(partition by soundex(name)
                        order by soundex(name)) rn
    from yt
  ) d
  where rn = 1
) t2
  on soundex(t1.name) = t2.snd;

See SQL Fiddle with Demo. This gives a result:

|    NAME | STDNAME |
---------------------
|   Jon Q |   Jon Q |
|  John Q |   Jon Q |
|  Jonn Q |   Jon Q |
|  Mary W |  Mary W |
| Marie W |  Mary W |
|  Matt H |  Matt H |

There are better options than SOUNDEX. I talked about it a bit on this [answer](http://superuser.com/questions/480133/record-matching-software-to-compare-two-tables-and-match-on-based/481592#481592) — billinkc, May 08 '13 at 19:22
@billinkc I am sure there are, this is what I thought of right off the bat. :) But you got an upvote from me on your other answer. — Taryn, May 08 '13 at 19:25
Soundex seems like it could work for me, except that it would have difficulty handling typos with e.g. transposed consonants. Given that possibility, I think I need to compare all of the names one-to-one, rather than "hashing" them into a soundex code and comparing those afterwards. — abeboparebop, May 08 '13 at 21:40

score 6 · Accepted Answer · edited Apr 13 '17 at 12:42

Assuming you copy and paste the jaro-winkler implementation from SSC (registration required), the following code will work. I tried to build a SQLFiddle for it but it kept going belly up when I was building the schema.

This implementation has a cheat---I'm using a cursor. Generally, cursors are not conducive to performance but in this case, you need to be able to compare the set against itself. There's probably a graceful number/tally table approach to eliminate the declared cursor.

DECLARE @SRC TABLE
(
    source_string varchar(50) NOT NULL
,   ref_id int identity(1,1) NOT NULL
);

-- Identify matches
DECLARE @WORK TABLE
(
    source_ref_id int NOT NULL
,   match_ref_id int NOT NULL
);

INSERT INTO
    @src
SELECT 'Jon Q'
UNION ALL SELECT 'John Q'
UNION ALL SELECT 'JOHN Q'
UNION ALL SELECT 'Jonn Q'
-- Oops on matching joan to jon
UNION ALL SELECT 'Joan Q'
UNION ALL SELECT 'june'
UNION ALL SELECT 'Mary W'
UNION ALL SELECT 'Marie W'
UNION ALL SELECT 'Matt H';

-- 2 problems to address
-- duplicates in our inbound set
-- duplicates against a reference set
--
-- Better matching will occur if names are split into ordinal entities
-- Splitting on whitespace is always questionable
--
-- Mat, Matt, Matthew 

DECLARE CSR CURSOR
READ_ONLY
FOR 
SELECT DISTINCT
    S1.source_string
,   S1.ref_id
FROM
    @SRC AS S1
ORDER BY
    S1.ref_id;

DECLARE @source_string varchar(50), @ref_id int
OPEN CSR

FETCH NEXT FROM CSR INTO @source_string, @ref_id
WHILE (@@fetch_status <> -1)
BEGIN
    IF (@@fetch_status <> -2)
    BEGIN
        IF NOT EXISTS
        (
            SELECT * FROM @WORK W WHERE W.match_ref_id = @ref_id
        )
        BEGIN
            INSERT INTO
                @WORK
            SELECT
                @ref_id
            ,   S.ref_id
            FROM
                @src S
                -- If we have already matched the value, skip it
                LEFT OUTER JOIN
                    @WORK W
                    ON W.match_ref_id = S.ref_id
            WHERE
                -- Don't match yourself
                S.ref_id <> @ref_id
                -- arbitrary threshold, will need to examine this for sanity
                AND dbo.fn_calculateJaroWinkler(@source_string, S.source_string) > .95
        END
    END
    FETCH NEXT FROM CSR INTO @source_string, @ref_id
END

CLOSE CSR

DEALLOCATE CSR

-- Show me the list of all the unmatched rows 
-- plus the retained

;WITH MATCHES AS
(
    SELECT 
        S1.source_string
    ,   S1.ref_id
    ,   S2.source_string AS match_source_string
    ,   S2.ref_id AS match_ref_id
    FROM 
        @SRC S1
        INNER JOIN
            @WORK W
            ON W.source_ref_id = S1.ref_id
        INNER JOIN
            @SRC S2
            ON S2.ref_id = W.match_ref_id
)
, UNMATCHES AS
(
    SELECT 
        S1.source_string
    ,   S1.ref_id
    ,   NULL AS match_source_string
    ,   NULL AS match_ref_id
    FROM 
        @SRC S1
        LEFT OUTER JOIN
            @WORK W
            ON W.source_ref_id = S1.ref_id
        LEFT OUTER JOIN
            @WORK S2
            ON S2.match_ref_id = S1.ref_id
    WHERE
        W.source_ref_id IS NULL
        and s2.match_ref_id IS NULL
)
SELECT
    M.source_string
,   M.ref_id
,   M.match_source_string
,   M.match_ref_id
FROM
    MATCHES M
UNION ALL
SELECT
    M.source_string
,   M.ref_id
,   M.match_source_string
,   M.match_ref_id
FROM
    UNMATCHES M;

-- To specifically solve your request

SELECT
    S.source_string AS Name
,   COALESCE(S2.source_string, S.source_string) As StdName
FROM
    @SRC S
    LEFT OUTER JOIN
        @WORK W
        ON W.match_ref_id = S.ref_id
    LEFT OUTER JOIN
        @SRC S2
        ON S2.ref_id = W.source_ref_id

query output 1

source_string   ref_id  match_source_string match_ref_id
Jon Q   1   John Q  2
Jon Q   1   JOHN Q  3
Jon Q   1   Jonn Q  4
Jon Q   1   Joan Q  5
june    6   NULL    NULL
Mary W  7   NULL    NULL
Marie W 8   NULL    NULL
Matt H  9   NULL    NULL

query output 2

Name    StdName
Jon Q   Jon Q
John Q  Jon Q
JOHN Q  Jon Q
Jonn Q  Jon Q
Joan Q  Jon Q
june    june
Mary W  Mary W
Marie W Marie W
Matt H  Matt H

There be dragons

Over on SuperUser, I talked about my experience matching people. In this section, I'll list some things to be aware of.

Speed

As part of your matching, hooray in that you have a birthday to augment the match process. I would actually propose you generate a match based exclusively on birthdate first. That is an exact match and one that, with a proper index, SQL Server will be able to quickly include/exclude rows. Because you're going to need it. The TSQL implementation is dog slow. I've been running the equivalent match against a dataset of 28k names (names that had been listed as conference attendees). There ought to be some good overlap there and while I did fill @src with data, it is a table variable with all that that implies but it's been running now for 15 minutes and still hasn't completed.

It's slow for a number of reasons but things that jumped out at me are all the looping and string manipulation in the functions. That is not where SQL Server shines. If you have a need to do a lot of this, it might be a good idea to convert them into CLR methods so at least you can leverage the strength of the .NET libraries for some of the manipulations.

One of the matches we used to use was the Double Metaphone and it would generate a pair of possible phonetic interpretations of the name. Instead of computing that every time, compute it once and store it alongside the name. That would help speed some of the matching. Unfortunately, it doesn't look like JW lends itself to breaking it down like that.

Look at iterating too. We'd first try the algs that we knew were fast. 'John' = 'John' so there's no need to pull out the big guns so we'd try a first pass of straight name checks. If we didn't find a match, we'd try harder. The hope was that by taking various swipes at matching we'd get the low hanging fruit as fast as possible and worry about the harder matches later.

Names

In my SU answer and in the code comments, I mention nicknames. Bill and Billy are going to match. Billy, Liam and William are definitely not going to match even though they may be the same person. You might want to look at a list like this to provide translation between nickname and full name. After running a set of matches on the supplied name, maybe we'd try looking for a match based on the possible root name.

Obviously, there are draw backs to this approach. For example, my grandfather-in-law is Max. Just Max. Not Maximilian, Maximus or any other things you might thing.

Your supplied names look like it's first and last concatenated together. Future readers, if you ever have the opportunity to capture individual portions of a name, please do so. There are products out there that will split names and try to match them up against directories to try and guess whether something is first/middle name or a surname but then you have people like "Robar Mike". If you saw that name there, you'd think Robar is a last name and you'd also pronounce it like "robber." Instead, Robar (say it with a French accent) is his first name and Mike is his last name. At any rate, I think you'll have a better matching experience if you can split first and last out into separate fields and match the individual pieces together. An exact last name match plus a partial first name match might suffice, especially in cases where legally they are "Franklin Roosevelt" and you have a candidate of "F. Roosevelt" Perhaps you have a rule that an initial letter can match. Or you don't.

Noise - as referenced in the JW post and my answer, strip out crap (punctuation, stop words, etc) for matching purposes. Also watch out for honorific tites (phd, jd, etc) and generationals (II, III, JR, SR). Our rule was a candidate with/without a generational could match one in the opposite state (Bob Jones Jr == Bob Jones) or could exactly match the generation (Bob Jones Sr = Bob Jones Sr) but you'd never want to match if both records supplied them and they were conflicting (Bob Jones Sr != Bob Jones Jr).

Case sensitivity, always check your database and tempdb to make sure you aren't making case sensitive matches. And if you are, convert everything to upper or lower for purposes of matching but don't ever throw the supplied casing away. Good luck trying to determine whether latessa should be Latessa, LaTessa or something else.

My query is coming up on a hour's worth of processing with no rows returned so I'm going to kill it and turn in. Best of luck, happy matching.

I'll comment now to say that this is an outstanding answer. I'll give it a shot and plan to come back and accept in a couple of days -- working through an intermediary unfortunately so it'll be slow. Could you generalize to a case with an exact match on DOB? I'm not at all familiar with T-SQL or the control structure for the cursor. (Feel free to assume it's a string.) — abeboparebop, May 09 '13 at 05:37
By the way, the database is rather larger than the one you tried: ~1.6M observations of ~20 variables. In the interim I've written up a Python solution (working on a csv file) that does what I want and may be less of a pain, though it's fairly memory-intensive. It will require a state government employee to install Python, however. :) — abeboparebop, May 09 '13 at 05:39
@billinkc, Can you please help me out for this: https://stackoverflow.com/questions/57850385/dynamic-ssis-package-for-fuzzy-grouping — MAK, Sep 10 '19 at 05:05