1

I have a table in my database where I have about 100k clients and with a erroneous system that was in place for quite some time we are just coming to find out that we might potentially have thousands of duplicate records, that should not have been duplicates.

This is the basic structure of the table...

ClientID         FName           LName           DOB          SysEMID
----------------------------------------------------------------------
123              Janette         Marukan         1/2/1990     111111S
478              Jannete         Maruckan        1/2/1990     111111S

These are not all fields in the system but what my goal is to be able to use the columns above in my search query. So I'm trying to take the first 2 or 3 letters of First and Last names, DOB as well as the SysEMID and find all records that could be potential duplicates.

As can be seen above in the table I have two clients, but their names are slightly different - in fact the two clients are both the same, but they are saved in the system as two. Is there a way for me to somehow query the whole tables and find that?

I really don't have 1 specific client in place - it's more of a general kind of query to show me potential duplicates of clients with the same 2 or 3 letters of 1st Name, Last Name and DOB for starters?

BobSki
  • 1,531
  • 2
  • 25
  • 61

4 Answers4

2

You would use window functions:

select t
from (select t.*,
             count(*) over (partition by left(fname, 2), left(lname, 2), dob, sysemid) as cnt
      from t
     ) t
where cnt > 1
order by sysemid, dob, fname, lname;

This may also be one of those rare situations where soundex() is actually useful. After all, it was designed with strings in mind. So:

select t
from (select t.*,
             count(*) over (partition by soundex(fname), soundex(lname), dob, sysemid) as cnt
      from t
     ) t
where cnt > 1
order by sysemid, dob, fname, lname;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Any way I can basically search by first 2 or 3 letters of first and last names, and DOB, have them be organized by first and last name, this way I can jsut copy the data into excel and go from there. Only showing me >1 records requires me to investigate each DOB but If i were able to see all of them, this way I can skip through and know which ones are duplicates and which ones are not. Long shot I guess. Thanks. – BobSki Jul 26 '17 at 19:34
1

I guess you have sovled the issue for new data but for old data what you can do is the following:

1) Make exact comparison that will get you the very exact duplicates. This is the easy part.

2) For Similar strings, for example Josh and Joshh you need either to use some code that you need to write to check for distance between two strings or a simpler way if you like coding in SQL only is to use DIFFERENCE function. this will give you a value of 0-4 where 4 means the strings are very similar. The bad thing about this is you need to give them a quick look by eye to make sure everything is good. For more info https://technet.microsoft.com/en-us/library/ms189282(v=sql.105).aspx

3) I prefer you go, open Visual studio or your preferred IDE, connect to your database , Select the whole data in the table, bring it in memory and implement a good string distance function.

Omar Zaarour
  • 302
  • 3
  • 9
  • It doesn't have to be exact I guess what would work if I could somehow use the DOB as my main identifier, since that will cut down the list significantly. Basically looking to find similar substrings of first and last names for repeating DOBs. Someone just needs this for reference, basically to see if a client might have all the same information but a different letter in the first name as shown in the example above. – BobSki Jul 25 '17 at 15:24
1

Like this:

SELECT LEFT(FName,2) AS FN, LEFT(LName,2) AS LN, DOB, COUNT(*)
FROM YourTable
GROUP BY LEFT(FName,2), LEFT(LName,2), DOB
HAVING COUNT(*) > 1
;
RBarryYoung
  • 55,398
  • 14
  • 96
  • 137
0

There is no magic bullet built into SQL Server. You will have to decide on your business rules and code them.

If you decide to look for matches in the first 3 letters of a name, for instance, use the LEFT() function to get the first 3 letters. To look at individual parts of the DOB, you can use the DATEPART() function.

Tab Alleman
  • 31,483
  • 7
  • 36
  • 52