I have a table with duplicate data similar to below example:
ID | ACCNO | ACCNAME | ADDRESS1 | ADDRESS2 | City
1 | 1001 | Joe B Ltd | 123 Street1 | | London
2 | 1001 | JoeB Ltd | 123 Street1 | | London
3 | 1001 | JoeB Ltd | 123 Street1 | | London
4 | 1001 | JoeB Ltd | 123 Street1 | London | London
5 | 1001 | JoeB Ltd | 129 Street9 | | London
ID is currently the unique primary key, however ACCNO should be when duplicates removed.
I've seen many queries to remove duplicate records such as https://stackoverflow.com/a/18719814/4949859
However I would like to choose which row to keep based on the count of duplicate rows. I believe that if I select a row from the grouped items with the highest count I'm most likely to get a correctly formatted address.
In my example using "NOT IN (SELECT MAX" or "MIN" will leave the wrong record in my case.
However when I use GROUP BY to get the highest count I can't include the ID field.
SELECT COUNT(ID), ACCNO, ACCNAME, ADDRESS1, ADDRESS2, CITY FROM SUPPLIERS GROUP BY ACCNO, ACCNAME, ADDRESS1, ADDRESS2, CITY ORDER BY COUNT(ID) DESC
This would give the result:
Count(ID) | ACCNO | ACCNAME | ADDRESS1 | ADDRESS2 | City
2 | 1001 | JoeB Ltd | 123 Street1 | | London
1 | 1001 | Joe B Ltd | 123 Street1 | | London
1 | 1001 | JoeB Ltd | 123 Street1 | London | London
1 | 1001 | JoeB Ltd | 129 Street9 | | London
Hope I'm making sense. I don't know how to return an ID (any) from a group where the count is highest. Does anybody else know how I might achieve this?
Edit:
I the above example grouping all columns except ID and getting a count, rows 2 and 3 would be grouped together giving a group count of 2 (the rest would be count ID of 1 as they are all unique) so I would want to keep row 2 or 3, doesn't matter which of those as they are both the same.
Edit 2:
I thought this was going to work:
DELETE
FROM SUPPLIERS
WHERE ID NOT IN
(SELECT TOP 1 MAX(ID) FROM SUPPLIERS
Group By ACCNO, ACCNAME, ADDRESS1, ADDRESS2, CITY
ORDER BY COUNT(ID) DESC)
Unfortunately this deletes all but one record, the select version of it looked promising:
SELECT *
FROM SUPPLIERS a
WHERE ID NOT IN
(SELECT TOP 1 MAX(ID) FROM SUPPLIERS b
WHERE a.ACCNO = b.ACCNO Group By ACCNO, ACCNAME, ADDRESS1, ADDRESS2, CITY
ORDER BY COUNT(ID) DESC)
Answer:
With thanks to user1751825 (marked as answer as got me closest to final result)
DELETE FROM SUPPLIERS WHERE ID IN (SELECT ID
FROM SUPPLIERS a
WHERE ID NOT IN
(SELECT TOP 1 MAX(ID) FROM SUPPLIERS b
WHERE a.ACCNO = b.ACCNO Group By ACCNO, ACCNAME, ADDRESS1, ADDRESS2, CITY
ORDER BY COUNT(ID) DESC))