Finding Duplicate Rows in SQL Server Based on Character Matching

Question

I'd like to find duplicate rows in a table based on matching two conditions. First condition, match on the data in the LastName field. Second condition, match on just the first three characters of data in the FirstName field.

For example, these two rows should be selected:

LastName       FirstName

 Williams       Robert

 Williams       Robbie

I attempted to build on RedFilter's solution: Finding duplicate rows in SQL Server but getting 0 records returned. Here's my query;

SELECT a.ObjGUID, a.LastName, a.FirstName, a.EmailAddress, ac.duplicateCount
FROM [Users].[dbo].[Known_Dupes_4] a
INNER JOIN (
SELECT
 LastName, FirstName, COUNT(*) AS duplicateCount
FROM [Users].[dbo].[Known_Dupes_4]
GROUP BY
 LastName, FirstName
HAVING 
 COUNT(*) > 1
) ac ON (UPPER(a.LastName) = UPPER(LTRIM(RTRIM(ac.LastName))) AND LEFT (UPPER(LTRIM(RTRIM(a.FirstName))),3) LIKE LEFT (UPPER(LTRIM(RTRIM(ac.FirstName))),3))
ORDER BY
 a.LastName, a.FirstName

Thanks

updated - this appears to work

SELECT a.EmployeeID, a.LastName, a.FirstName, a.EmailAddress, ac.duplicateCount
FROM [Users].[dbo].[Known_Dupes_4] a
INNER JOIN (
SELECT
LEFT(FirstName,3) as firstNameShort, COUNT(*) AS duplicateCount, LastName
FROM [Users].[dbo].[Known_Dupes_4]
GROUP BY
LastName, LEFT(FirstName,3)
HAVING 
COUNT(*) > 1
) ac ON (a.LastName) = ac.LastName AND LEFT (a.FirstName,3) LIKE LEFT (ac.firstNameShort,3)
ORDER BY a.LastName, a.FirstName

Do you want to match only if both those conditions are true. — George Joseph, Nov 30 '18 at 17:50
@George - Yes, looking to match only when both conditions are true. — NRT8, Nov 30 '18 at 19:26
After removing the unnecessary case changes and trimming (since working within a single table-column) and editing for the Group BY as @PiotrS, it appears to be working. — NRT8, Nov 30 '18 at 19:28

Tim Mylott · Answer 1 · 2018-11-30T22:30:32.817

0

One option would be to just get a COUNT() and partition it by LastName and the first 3 characters of FirstName and then filter.

Have a look at this:

DECLARE @TestData TABLE
    (
        [FirstName] NVARCHAR(100)
      , [LastName] NVARCHAR(100)
    );

INSERT INTO @TestData (
                          [FirstName]
                        , [LastName]
                      )
VALUES ( ' Robert ', 'Williams' )
     , ( 'Robbie', 'Williams ' )
     , ( 'Robin', ' Williams ' )
     , ( ' Rodger', 'Williams' ); --This one doesn't get returned

SELECT *
FROM   (
           SELECT *
                , COUNT(*) OVER ( PARTITION BY LTRIM(RTRIM([LastName]))
                                             , SUBSTRING(
                                                            LTRIM(RTRIM([FirstName]))
                                                          , 1
                                                          , 3
                                                        )
                                ) AS [DupNameCount]
           FROM   @TestData
       ) AS [Dup]
WHERE  [Dup].[DupNameCount] > 1;

edited Nov 30 '18 at 22:30

answered Nov 30 '18 at 17:51

Tim Mylott

2,553
1
5
11

Nice. This appears to work after I changed the SUBSTRING parameter to "4". – NRT8 Nov 30 '18 at 18:40
This solution isn't picking up all the duplicates for some reason. Please see my updated code. – NRT8 Nov 30 '18 at 19:36
@NRT8 is there spaces around any of your data in FirstName or LastName? That could throw it off and I can easily update to accommodate. – Tim Mylott Nov 30 '18 at 19:45
I checked for spaces but didn't find any. I think because it's joining on the same data, spaces and case aren't a factor. – NRT8 Nov 30 '18 at 21:30
@NRT8 not sure without setting a larger sample of your data. Your first comment about having to change the substring to 4 seemed to indicate to me that your data includes leading spaces. Either way, I updated my answer to deal with that anyways. – Tim Mylott Nov 30 '18 at 22:33
you're right about the Substringto 4. I was testing against two separate data sets - turns out one set had leading spaces. I'll continue to look for a pattern in the records not being returned. If I find one, I'll update. – NRT8 Dec 03 '18 at 16:31

PiotrS · Accepted Answer · 2018-11-30T22:52:47.273

0

You can group by LEFT(FirstName, 3) , for example:

    declare @t table (firstName nvarchar(20), lastname nvarchar(20))

    insert into @t
    values ('Robert', 'Williams'), ('Robbie', 'Williams'), ('NotRob', 'Williams'),  ('Steve', 'Other'), ('Steven', 'Other'), ('Someone', 'Else'), ('Roberto', 'Williams')

    select t1.* from @t t1
    cross apply (
            select
                LEFT(firstName, 3) as firstNameShort, lastname
            from
                @t t2
            where LEFT(t2.firstName, 3) = LEFT(t1.firstName, 3)
                and t2.lastname = t1.lastname
            group by
                lastname, LEFT(firstName, 3) 
            having 
                COUNT(*) > 1) t3
    order by t1.lastname, t1.firstName

edited Nov 30 '18 at 22:52

answered Nov 30 '18 at 17:53

PiotrS

180
3
16

good to know but this solution returns "Rob Williams". I'm needing "Robert Williams" and "Robbie Williams" returned. – NRT8 Nov 30 '18 at 18:26
@NRT8: Yes, I was giving you a hint rather than complete solution. I have now edited the answer which produces the results you want – PiotrS Nov 30 '18 at 22:48
That hint was just what I needed to fix my code - Thanks – NRT8 Dec 03 '18 at 16:19

score 0 · Answer 3 · answered Nov 30 '18 at 17:56

Here's how I use CTE and SUM() over() to get only the duplicates.

create table #example
(
    LastName varchar(25)      
   ,FirstName varchar(25)
)

INSERT INTO #example
VALUES
     ('Williams','Robert')
    ,('Williams','Robbie')
    ,('Jader','Arruda')
    ,('Jader','Gabriel')

WITH CTE 
AS
(
    SELECT FirstName, LastName,LEFT(FirstName,3) AS First_3_Letter_Name, SUM(1) 
    OVER(PARTITION BY LastName,LEFT(FirstName,3) ORDER BY  
    LastName,LEFT(FirstName,3)) AS ID
    FROM #example
)
SELECT * FROM CTE AS a
WHERE ID > 1;

Finding Duplicate Rows in SQL Server Based on Character Matching

3 Answers3