0

I asked a similar question [Querying database to find potential duplicates based on multiple columns and it was kindly answered by master of SQL Gordon Linoff.

All in all I have about 100k client records, only to find out now that there's potentially a lot of duplicate records - in most cases, first and last name are misspelled.

Here's the query that Gordon wrote for me...

select t
from (select t.*,
         count(*) over (partition by soundex(fname), soundex(lname), dob) as cnt
  from t
 ) t
where cnt > 1
order by dob, fname, lname; 

And it works great - however it only seems to pull up very close matches. I was trying to update it to only try to match up the first three letters, but it doesn't seem to be doing a good job, it gives me a lot more records with the same DOB, however, the first/last name values are not only the first 3 letters, but a mix match basically.

So here's what I tried to do in the count(*) line

 count(*) over (partition by substring(soundex(fname),0,3), substring(soundex(lname),0,3), dob) as cnt

but like I said it doesn't pull up exact 3 character matches, so I get first names such as

Ana
Annette

Is there any way that I can update it to be able to dig down a bit deeper in finding my duplicates but only use the first 3 characters of first and lastname, while still using Soundex? Or potentially not use Soundex?

BobSki
  • 1,531
  • 2
  • 25
  • 61
  • For an alternative approach, see https://stackoverflow.com/questions/653157/a-better-similarity-ranking-algorithm-for-variable-length-strings – rd_nielsen Jul 27 '17 at 14:53
  • @rd_nielsen - thanks – BobSki Jul 27 '17 at 14:57
  • 1
    There is a SQL function, created years ago by Keith Henry, called DoubleMetaPhone. It works far better than SOUNDEX. Unfortunately, the code is too many characters to post here, but a Google search will fine it pretty quickly. – Jason A. Long Jul 27 '17 at 22:34
  • is it possible to share some sample data? – Smart003 Jul 28 '17 at 08:50

1 Answers1

0

check this one

;WITH cte(fn, ln, dob)
     AS (SELECT 'anna',
                'elizibeth',
                '2000-12-19'
         UNION ALL
         SELECT 'ann',
                'elizibeth',
                '2000-12-19'
         UNION ALL
         SELECT 'paul',
                'eliot',
                '2000-12-13'
         UNION ALL
         SELECT 'rindol',
                'eliot',
                '2000-12-13')
SELECT fn,ln,count(*)           
FROM   (SELECT Substring(fn, 1, 3) fn,
               Substring(ln, 1, 3) ln,
               dob
        FROM   cte)a 
group by  fn,ln
Smart003
  • 1,119
  • 2
  • 16
  • 31