use Soundex and Substring to be able to get potential duplicate values from DB

Question

I asked a similar question [Querying database to find potential duplicates based on multiple columns and it was kindly answered by master of SQL Gordon Linoff.

All in all I have about 100k client records, only to find out now that there's potentially a lot of duplicate records - in most cases, first and last name are misspelled.

Here's the query that Gordon wrote for me...

select t
from (select t.*,
         count(*) over (partition by soundex(fname), soundex(lname), dob) as cnt
  from t
 ) t
where cnt > 1
order by dob, fname, lname;

And it works great - however it only seems to pull up very close matches. I was trying to update it to only try to match up the first three letters, but it doesn't seem to be doing a good job, it gives me a lot more records with the same DOB, however, the first/last name values are not only the first 3 letters, but a mix match basically.

So here's what I tried to do in the count(*) line

 count(*) over (partition by substring(soundex(fname),0,3), substring(soundex(lname),0,3), dob) as cnt

but like I said it doesn't pull up exact 3 character matches, so I get first names such as

Ana
Annette

Is there any way that I can update it to be able to dig down a bit deeper in finding my duplicates but only use the first 3 characters of first and lastname, while still using Soundex? Or potentially not use Soundex?

For an alternative approach, see https://stackoverflow.com/questions/653157/a-better-similarity-ranking-algorithm-for-variable-length-strings — rd_nielsen, Jul 27 '17 at 14:53
There is a SQL function, created years ago by Keith Henry, called DoubleMetaPhone. It works far better than SOUNDEX. Unfortunately, the code is too many characters to post here, but a Google search will fine it pretty quickly. — Jason A. Long, Jul 27 '17 at 22:34

score 0 · Answer 1 · answered Jul 28 '17 at 10:37

check this one

;WITH cte(fn, ln, dob)
     AS (SELECT 'anna',
                'elizibeth',
                '2000-12-19'
         UNION ALL
         SELECT 'ann',
                'elizibeth',
                '2000-12-19'
         UNION ALL
         SELECT 'paul',
                'eliot',
                '2000-12-13'
         UNION ALL
         SELECT 'rindol',
                'eliot',
                '2000-12-13')
SELECT fn,ln,count(*)           
FROM   (SELECT Substring(fn, 1, 3) fn,
               Substring(ln, 1, 3) ln,
               dob
        FROM   cte)a 
group by  fn,ln

use Soundex and Substring to be able to get potential duplicate values from DB

1 Answers1