Assuming you copy and paste the jaro-winkler implementation from SSC (registration required), the following code will work. I tried to build a SQLFiddle for it but it kept going belly up when I was building the schema.
This implementation has a cheat---I'm using a cursor. Generally, cursors are not conducive to performance but in this case, you need to be able to compare the set against itself. There's probably a graceful number/tally table approach to eliminate the declared cursor.
DECLARE @SRC TABLE
(
source_string varchar(50) NOT NULL
, ref_id int identity(1,1) NOT NULL
);
-- Identify matches
DECLARE @WORK TABLE
(
source_ref_id int NOT NULL
, match_ref_id int NOT NULL
);
INSERT INTO
@src
SELECT 'Jon Q'
UNION ALL SELECT 'John Q'
UNION ALL SELECT 'JOHN Q'
UNION ALL SELECT 'Jonn Q'
-- Oops on matching joan to jon
UNION ALL SELECT 'Joan Q'
UNION ALL SELECT 'june'
UNION ALL SELECT 'Mary W'
UNION ALL SELECT 'Marie W'
UNION ALL SELECT 'Matt H';
-- 2 problems to address
-- duplicates in our inbound set
-- duplicates against a reference set
--
-- Better matching will occur if names are split into ordinal entities
-- Splitting on whitespace is always questionable
--
-- Mat, Matt, Matthew
DECLARE CSR CURSOR
READ_ONLY
FOR
SELECT DISTINCT
S1.source_string
, S1.ref_id
FROM
@SRC AS S1
ORDER BY
S1.ref_id;
DECLARE @source_string varchar(50), @ref_id int
OPEN CSR
FETCH NEXT FROM CSR INTO @source_string, @ref_id
WHILE (@@fetch_status <> -1)
BEGIN
IF (@@fetch_status <> -2)
BEGIN
IF NOT EXISTS
(
SELECT * FROM @WORK W WHERE W.match_ref_id = @ref_id
)
BEGIN
INSERT INTO
@WORK
SELECT
@ref_id
, S.ref_id
FROM
@src S
-- If we have already matched the value, skip it
LEFT OUTER JOIN
@WORK W
ON W.match_ref_id = S.ref_id
WHERE
-- Don't match yourself
S.ref_id <> @ref_id
-- arbitrary threshold, will need to examine this for sanity
AND dbo.fn_calculateJaroWinkler(@source_string, S.source_string) > .95
END
END
FETCH NEXT FROM CSR INTO @source_string, @ref_id
END
CLOSE CSR
DEALLOCATE CSR
-- Show me the list of all the unmatched rows
-- plus the retained
;WITH MATCHES AS
(
SELECT
S1.source_string
, S1.ref_id
, S2.source_string AS match_source_string
, S2.ref_id AS match_ref_id
FROM
@SRC S1
INNER JOIN
@WORK W
ON W.source_ref_id = S1.ref_id
INNER JOIN
@SRC S2
ON S2.ref_id = W.match_ref_id
)
, UNMATCHES AS
(
SELECT
S1.source_string
, S1.ref_id
, NULL AS match_source_string
, NULL AS match_ref_id
FROM
@SRC S1
LEFT OUTER JOIN
@WORK W
ON W.source_ref_id = S1.ref_id
LEFT OUTER JOIN
@WORK S2
ON S2.match_ref_id = S1.ref_id
WHERE
W.source_ref_id IS NULL
and s2.match_ref_id IS NULL
)
SELECT
M.source_string
, M.ref_id
, M.match_source_string
, M.match_ref_id
FROM
MATCHES M
UNION ALL
SELECT
M.source_string
, M.ref_id
, M.match_source_string
, M.match_ref_id
FROM
UNMATCHES M;
-- To specifically solve your request
SELECT
S.source_string AS Name
, COALESCE(S2.source_string, S.source_string) As StdName
FROM
@SRC S
LEFT OUTER JOIN
@WORK W
ON W.match_ref_id = S.ref_id
LEFT OUTER JOIN
@SRC S2
ON S2.ref_id = W.source_ref_id
query output 1
source_string ref_id match_source_string match_ref_id
Jon Q 1 John Q 2
Jon Q 1 JOHN Q 3
Jon Q 1 Jonn Q 4
Jon Q 1 Joan Q 5
june 6 NULL NULL
Mary W 7 NULL NULL
Marie W 8 NULL NULL
Matt H 9 NULL NULL
query output 2
Name StdName
Jon Q Jon Q
John Q Jon Q
JOHN Q Jon Q
Jonn Q Jon Q
Joan Q Jon Q
june june
Mary W Mary W
Marie W Marie W
Matt H Matt H
There be dragons
Over on SuperUser, I talked about my experience matching people. In this section, I'll list some things to be aware of.
Speed
As part of your matching, hooray in that you have a birthday to augment the match process. I would actually propose you generate a match based exclusively on birthdate first. That is an exact match and one that, with a proper index, SQL Server will be able to quickly include/exclude rows. Because you're going to need it. The TSQL implementation is dog slow. I've been running the equivalent match against a dataset of 28k names (names that had been listed as conference attendees). There ought to be some good overlap there and while I did fill @src with data, it is a table variable with all that that implies but it's been running now for 15 minutes and still hasn't completed.
It's slow for a number of reasons but things that jumped out at me are all the looping and string manipulation in the functions. That is not where SQL Server shines. If you have a need to do a lot of this, it might be a good idea to convert them into CLR methods so at least you can leverage the strength of the .NET libraries for some of the manipulations.
One of the matches we used to use was the Double Metaphone and it would generate a pair of possible phonetic interpretations of the name. Instead of computing that every time, compute it once and store it alongside the name. That would help speed some of the matching. Unfortunately, it doesn't look like JW lends itself to breaking it down like that.
Look at iterating too. We'd first try the algs that we knew were fast. 'John' = 'John' so there's no need to pull out the big guns so we'd try a first pass of straight name checks. If we didn't find a match, we'd try harder. The hope was that by taking various swipes at matching we'd get the low hanging fruit as fast as possible and worry about the harder matches later.
Names
In my SU answer and in the code comments, I mention nicknames. Bill and Billy are going to match. Billy, Liam and William are definitely not going to match even though they may be the same person. You might want to look at a list like this to provide translation between nickname and full name. After running a set of matches on the supplied name, maybe we'd try looking for a match based on the possible root name.
Obviously, there are draw backs to this approach. For example, my grandfather-in-law is Max. Just Max. Not Maximilian, Maximus or any other things you might thing.
Your supplied names look like it's first and last concatenated together. Future readers, if you ever have the opportunity to capture individual portions of a name, please do so. There are products out there that will split names and try to match them up against directories to try and guess whether something is first/middle name or a surname but then you have people like "Robar Mike". If you saw that name there, you'd think Robar is a last name and you'd also pronounce it like "robber." Instead, Robar (say it with a French accent) is his first name and Mike is his last name. At any rate, I think you'll have a better matching experience if you can split first and last out into separate fields and match the individual pieces together. An exact last name match plus a partial first name match might suffice, especially in cases where legally they are "Franklin Roosevelt" and you have a candidate of "F. Roosevelt" Perhaps you have a rule that an initial letter can match. Or you don't.
Noise - as referenced in the JW post and my answer, strip out crap (punctuation, stop words, etc) for matching purposes. Also watch out for honorific tites (phd, jd, etc) and generationals (II, III, JR, SR). Our rule was a candidate with/without a generational could match one in the opposite state (Bob Jones Jr == Bob Jones) or could exactly match the generation (Bob Jones Sr = Bob Jones Sr) but you'd never want to match if both records supplied them and they were conflicting (Bob Jones Sr != Bob Jones Jr).
Case sensitivity, always check your database and tempdb to make sure you aren't making case sensitive matches. And if you are, convert everything to upper or lower for purposes of matching but don't ever throw the supplied casing away. Good luck trying to determine whether latessa should be Latessa, LaTessa or something else.
My query is coming up on a hour's worth of processing with no rows returned so I'm going to kill it and turn in. Best of luck, happy matching.