Can I delete database duplicates based on multiple columns?

Question

I asked this question a while back to delete duplicate records based on a column. The answer worked great:

delete from tbl
where id NOT in
(
select  min(id)
from tbl
group by sourceid
)

I now have a simillar situation but the definition of duplicate record is based on multiple columns. How can I alter this above SQL to identify duplicate records where a unique record is define as concatenated from Col1 + Col2 + Col3. Would i just do something like this ?

delete from tbl
where id NOT in
(
select  min(id)
from tbl
group by col1, col2, col3
)

The principle still stands - it does not matter if grouping is by one or several columns. You are going to retain only the first row in a group. But - if you don't mind me saying - be sure to check your data before you issue delete. Convert delete to select and see what will / will not survive. — Nikola Markovinović, Jul 23 '12 at 14:41
@Nikola Markovinovic - it looks like its doing the right thing but just wanted to confirm — leora, Jul 23 '12 at 15:00
@leora I find the syntax you're using cumbersome to conceptualize / reverse engineer. And it doesn't translate well to cases where id may be null (since `not in` does not behave as you might expect when nulls are involved). I realized that's not likely to be a factor here, but it is important to learn about CTEs and NOT EXISTS for cases where it might... — Aaron Bertrand, Jul 23 '12 at 15:20
You didn't say what version of SQL Server. 2005? 2008? 2012? — ErikE, Dec 01 '12 at 02:08

Aaron Bertrand · Accepted Answer · 2012-11-29T14:09:39.480

This shows the rows you want to keep:

;WITH x AS 
(
  SELECT col1, col2, col3, rn = ROW_NUMBER() OVER 
      (PARTITION BY col1, col2, col3 ORDER BY id)
  FROM dbo.tbl
)
SELECT col1, col2, col3 FROM x WHERE rn = 1;

This shows the rows you want to delete:

;WITH x AS 
(
  SELECT col1, col2, col3, rn = ROW_NUMBER() OVER 
      (PARTITION BY col1, col2, col3 ORDER BY id)
  FROM dbo.tbl
)
SELECT col1, col2, col3 FROM x WHERE rn > 1;

And once you're happy that the above two sets are correct, the following will actually delete them:

;WITH x AS 
(
  SELECT col1, col2, col3, rn = ROW_NUMBER() OVER 
      (PARTITION BY col1, col2, col3 ORDER BY id)
  FROM dbo.tbl
)
DELETE x WHERE rn > 1;

Note that in all three queries, the first 6 lines are identical, and only the subsequent query after the CTE has changed.

score 4 · Answer 2 · edited Nov 29 '12 at 15:05

Try this one. I created a table tblA with three columns.

CREATE TABLE tblA
(
id int IDENTITY(1, 1),
colA int, 
colB int, 
colC int
)

And added some duplicate values.

INSERT INTO tblA VALUES (1, 2, 3)
INSERT INTO tblA VALUES (1, 2, 3)
INSERT INTO tblA VALUES (4, 5, 6)
INSERT INTO tblA VALUES (7, 8, 9)
INSERT INTO tblA VALUES (7, 8, 9)

If you replace the select with a delete in the statement below you will have your multiple column delete working.

SELECT MIN(Id) as id
FROM
(
SELECT COUNT(*) as aantal, a.colA, a.colB, a.colC
FROM tblA       a
INNER JOIN tblA b   ON b.ColA = a.ColA
                    AND b.ColB = a.ColB
                    AND b.ColC = a.ColC
GROUP BY a.id, a.colA, a.colB, a.colC
HAVING COUNT(*) > 1
) c
INNER JOIN tblA d ON d.ColA = c.ColA
                    AND d.ColB = c.ColB
                    AND d.ColC = c.ColC
GROUP BY d.colA, d.colB, d.colC

Can I delete database duplicates based on multiple columns?

2 Answers2

Linked