SQL remove duplicates - keeping a record included in the highest count

Question

I have a table with duplicate data similar to below example:

ID | ACCNO  | ACCNAME    | ADDRESS1     | ADDRESS2    | City
1  | 1001   | Joe B Ltd  | 123 Street1  |             | London
2  | 1001   | JoeB Ltd   | 123 Street1  |             | London
3  | 1001   | JoeB Ltd   | 123 Street1  |             | London
4  | 1001   | JoeB Ltd   | 123 Street1  | London      | London
5  | 1001   | JoeB Ltd   | 129 Street9  |             | London

ID is currently the unique primary key, however ACCNO should be when duplicates removed.

I've seen many queries to remove duplicate records such as https://stackoverflow.com/a/18719814/4949859

However I would like to choose which row to keep based on the count of duplicate rows. I believe that if I select a row from the grouped items with the highest count I'm most likely to get a correctly formatted address.

In my example using "NOT IN (SELECT MAX" or "MIN" will leave the wrong record in my case.

However when I use GROUP BY to get the highest count I can't include the ID field.

SELECT COUNT(ID), ACCNO, ACCNAME, ADDRESS1, ADDRESS2, CITY FROM SUPPLIERS GROUP BY ACCNO, ACCNAME, ADDRESS1, ADDRESS2, CITY ORDER BY COUNT(ID) DESC

This would give the result:

Count(ID) | ACCNO  | ACCNAME    | ADDRESS1     | ADDRESS2    | City
2         | 1001   | JoeB Ltd   | 123 Street1  |             | London
1         | 1001   | Joe B Ltd  | 123 Street1  |             | London
1         | 1001   | JoeB Ltd   | 123 Street1  | London      | London
1         | 1001   | JoeB Ltd   | 129 Street9  |             | London

Hope I'm making sense. I don't know how to return an ID (any) from a group where the count is highest. Does anybody else know how I might achieve this?

Edit:

I the above example grouping all columns except ID and getting a count, rows 2 and 3 would be grouped together giving a group count of 2 (the rest would be count ID of 1 as they are all unique) so I would want to keep row 2 or 3, doesn't matter which of those as they are both the same.

Edit 2:

I thought this was going to work:

DELETE
FROM SUPPLIERS
WHERE ID NOT IN 
 (SELECT TOP 1 MAX(ID) FROM SUPPLIERS  
  Group By ACCNO, ACCNAME, ADDRESS1, ADDRESS2, CITY 
  ORDER BY COUNT(ID) DESC)

Unfortunately this deletes all but one record, the select version of it looked promising:

SELECT *
FROM SUPPLIERS a
WHERE ID NOT IN 
 (SELECT TOP 1 MAX(ID) FROM SUPPLIERS b 
  WHERE a.ACCNO = b.ACCNO Group By ACCNO, ACCNAME, ADDRESS1, ADDRESS2, CITY 
  ORDER BY COUNT(ID) DESC)

Answer:

With thanks to user1751825 (marked as answer as got me closest to final result)

DELETE FROM SUPPLIERS WHERE ID IN (SELECT ID
FROM SUPPLIERS a
 WHERE ID NOT IN 
  (SELECT TOP 1 MAX(ID) FROM SUPPLIERS b 
  WHERE a.ACCNO = b.ACCNO Group By ACCNO, ACCNAME, ADDRESS1, ADDRESS2, CITY 
  ORDER BY COUNT(ID) DESC))

The "TOP 1" is what's causing problems here. I think my amended answer should work. — user1751825, Apr 07 '16 at 12:16

freakyhat · Answer 1 · 2016-04-07T10:44:18.613

1

As per my understanding, in the example you provided, you want to keep the record with ID = 5 and delete the rest.

WITH CTE AS(
   SELECT ID, ACCNO, ACCNAME, ADDRESS1, ADDRESS2, CITY,
       RN = ROW_NUMBER()OVER(PARTITION BY ACCNO ORDER BY ID DESC)
  FROM SUPPLIERS
)
DELETE FROM CTE WHERE RN > 1

That should do the trick!

edited Apr 07 '16 at 10:44

answered Apr 07 '16 at 10:11

freakyhat

471
1
4
11

No I would want to keep ID 2 or 3, I've updated question – gavsmith Apr 07 '16 at 10:18
And what if all of the duplicates have the same count for a particular `ACCNO`? – freakyhat Apr 07 '16 at 10:42
Then it would just have to take the first one it finds – gavsmith Apr 07 '16 at 10:51

user1751825 · Accepted Answer · 2016-04-07T12:10:20.637

0

I think this should do what you need.

delete from SUPPLIERS
where ID NOT IN (
    Select max(ID)
    FROM SUPPLIERS
    Group by ACCNO
)

edited Apr 07 '16 at 12:10

answered Apr 07 '16 at 10:10

user1751825

4,029
1
28
58

How can I use this as a subquery of a delete statement to filter out the rows I want to keep vs the rows I want to delete? – gavsmith Apr 07 '16 at 11:14
You just want to keep the latest row for each ACCNO, is that correct? If so, this should be the simplest solution. – user1751825 Apr 07 '16 at 12:12
No, I'm trying keep a row that is included in the highest count of exact duplicates (Per ACCNO). – gavsmith Apr 07 '16 at 12:45
I've got there in the end, put the answer in the question. Marked this as the answer as it got me the closest to what I needed thanks – gavsmith Apr 07 '16 at 13:01

score 0 · Answer 3 · answered Apr 07 '16 at 11:56

this worked for me.

Table

ID   ACCNO       ACCNAME         ADDRESS1        ADDRESS2        City
1    1001        Joe B Ltd       123 Street1                     London
2    1001        JoeB Ltd        123 Street1                     London
3    1001        JoeB Ltd        123 Street1                     London
4    1001        JoeB Ltd        123 Street1     London          London
5    1001        JoeB Ltd        129 Street9                     London
6    67          Nise            Gata1      
7    67          Nisse           Gata2      
8    67          Nisse           Gata1           Haninge         Stockholm

RESULT:

ACCNO    ACCNAME         ADDRESS1        ADDRESS2        City
1001     JoeB Ltd        123 Street1     London          London
67       Nisse           Gata1           Haninge         Stockholm

Code:

select distinct 
    [ ACCNO  ],
    FIRST_VALUE([ ACCNAME    ]) OVER (PARTITION BY [ ACCNO  ] ORDER BY case when [ ACCNAME    ] is null then 1 else 0 end, rownumber  ) as [ ACCNAME    ],
    FIRST_VALUE([ ADDRESS1     ]) OVER (PARTITION BY [ ACCNO  ] ORDER BY case when [ ADDRESS1     ] is null then 1 else 0 end, rownumber  ) as [ ADDRESS1     ],
    FIRST_VALUE([ ADDRESS2    ]) OVER (PARTITION BY [ ACCNO  ] ORDER BY case when [ ADDRESS2    ] is null then 1 else 0 end, rownumber  ) as [ ADDRESS2    ],
    FIRST_VALUE([ City]) OVER (PARTITION BY [ ACCNO  ] ORDER BY case when [ City] is null then 1 else 0 end, rownumber  ) as [ City]
FROM 
(
    SELECT 
          [ ACCNO  ]
          ,[ ACCNAME    ]
          ,[ ADDRESS1     ]
          ,case when ltrim(rtrim([ ADDRESS2    ] )) = '' then null else [ ADDRESS2    ] end as [ ADDRESS2    ] -- spaces = NULL
          ,case when ltrim(rtrim([ City] )) = '' then null else [ City] end as [ City]

          ,count(*) as quantity
          ,ROW_NUMBER() OVER (
                                PARTITION BY [ ACCNO  ] 
                                ORDER BY 
                                    [ ACCNO  ], 
                                    count(*) desc
                             ) as rownumber
      FROM [dbo].[test_sql]  
      GROUP BY cube ( [ ACCNO  ]
          ,[ ACCNAME    ]
          ,[ ADDRESS1     ]
          ,[ ADDRESS2    ]
          ,[ City])
        HAVING [ ACCNO  ] is not null 
 ) myGroup

I don't think it needs to be this complex. – user1751825 Apr 07 '16 at 12:13 — user1751825, Apr 07 '16 at 12:13

SQL remove duplicates - keeping a record included in the highest count

3 Answers3