How can I delete duplicate rows in a table

Question

I have a table with say 3 columns. There's no primary key so there can be duplicate rows. I need to just keep one and delete the others. Any idea how to do this is Sql Server?

Manrico Corazzi · Accepted Answer · 2008-09-18T12:00:28.993

23

I'd SELECT DISTINCT the rows and throw them into a temporary table, then drop the source table and copy back the data from the temp. EDIT: now with code snippet!

INSERT INTO TABLE_2 
SELECT DISTINCT * FROM TABLE_1
GO
DELETE FROM TABLE_1
GO
INSERT INTO TABLE_1
SELECT * FROM TABLE_2
GO

edited Sep 18 '08 at 12:00

answered Sep 18 '08 at 11:37

Manrico Corazzi

11,299
10
48
62

that's the cleanest and most generic solution, given that you have the disk space (the final frontier) – tzot Sep 18 '08 at 11:48
So there's no way to do it using an SQL query? – Malik Daud Ahmad Khokhar Sep 18 '08 at 11:50
1

Actually that's three queries: INSERT INTO TABLE_2 SELECT DISTINCT * FROM TABLE_1 GO DELETE FROM TABLE_1 GO INSERT INTO TABLE_1 SELECT * FROM TABLE_2 GO – Manrico Corazzi Sep 18 '08 at 11:59
I meant without creating a new table. – Malik Daud Ahmad Khokhar Sep 18 '08 at 12:14
1

This can fail if there are tables that depend on this table. – Joel Coehoorn Sep 18 '08 at 13:08
1

Pretty unlikely: it's unsafe to create a FK to a table w/o primary key (if that's what you meant with "depends")... – Manrico Corazzi Sep 18 '08 at 13:38

score 7 · Answer 2 · answered Sep 18 '08 at 11:36

7

Add an identity column to act as a surrogate primary key, and use this to identify two of the three rows to be deleted.

I would consider leaving the identity column in place afterwards, or if this is some kind of link table, create a compound primary key on the other columns.

answered Sep 18 '08 at 11:36

Ian Nelson

57,123
20
76
103

Adding an identity column will definitely help. SQL Server will generate a ghost column to make each record unique, but you will not be able to query this column. The identity column will reduce some of that overhead and guarantee uniqueness. – Sep 18 '08 at 15:48

Martin · Answer 3 · 2013-01-10T21:16:27.650

The following example works as well when your PK is just a subset of all table columns.

(Note: I like the approach with inserting another surrogate id column more. But maybe this solution comes handy as well.)

First find the duplicate rows:

SELECT col1, col2, count(*)
FROM t1
GROUP BY col1, col2
HAVING count(*) > 1

If there are only few, you can delete them manually:

set rowcount 1
delete from t1
where col1=1 and col2=1

The value of "rowcount" should be n-1 times the number of duplicates. In this example there are 2 dulpicates, therefore rowcount is 1. If you get several duplicate rows, you have to do this for every unique primary key.

If you have many duplicates, then copy every key once into anoher table:

SELECT col1, col2, col3=count(*)
INTO holdkey
FROM t1
GROUP BY col1, col2
HAVING count(*) > 1

Then copy the keys, but eliminate the duplicates.

SELECT DISTINCT t1.*
INTO holddups
FROM t1, holdkey
WHERE t1.col1 = holdkey.col1
AND t1.col2 = holdkey.col2

In your keys you have now unique keys. Check if you don't get any result:

SELECT col1, col2, count(*)
FROM holddups
GROUP BY col1, col2

Delete the duplicates from the original table:

DELETE t1
FROM t1, holdkey
WHERE t1.col1 = holdkey.col1
AND t1.col2 = holdkey.col2

Insert the original rows:

INSERT t1 SELECT * FROM holddups

btw and for completeness: In Oracle there is a hidden field you could use (rowid):

DELETE FROM our_table
WHERE rowid not in
(SELECT MIN(rowid)
FROM our_table
GROUP BY column1, column2, column3... ;

see: Microsoft Knowledge Site

You should have mentioned you got this from Microsoft's support site. http://support.microsoft.com/kb/139444 — Tony_Henrich, May 19 '10 at 07:51
@Tony: That is correct. To my defense: I had this copied in my local programming wiki and wasn't even aware anymore where it came from. — Martin, Jan 10 '13 at 21:19

score 4 · Answer 4 · answered Sep 19 '08 at 06:52

This is a way to do it with Common Table Expressions, CTE. It involves no loops, no new columns or anything and won't cause any unwanted triggers to fire (due to deletes+inserts).

Inspired by this article.

CREATE TABLE #temp (i INT)

INSERT INTO #temp VALUES (1)
INSERT INTO #temp VALUES (1)
INSERT INTO #temp VALUES (2)
INSERT INTO #temp VALUES (3)
INSERT INTO #temp VALUES (3)
INSERT INTO #temp VALUES (4)

SELECT * FROM #temp

;
WITH [#temp+rowid] AS
(SELECT ROW_NUMBER() OVER (ORDER BY i ASC) AS ROWID, * FROM #temp)
DELETE FROM [#temp+rowid] WHERE rowid IN 
(SELECT MIN(rowid) FROM [#temp+rowid] GROUP BY i HAVING COUNT(*) > 1)

SELECT * FROM #temp

DROP TABLE #temp

@Jonas - that, my friend, is very cool. And it just solved a problem i had. Thanks! — b w, Oct 17 '11 at 19:40

score 4 · Answer 5 · edited May 23 '17 at 12:07

4

Here's the method I used when I asked this question -

DELETE MyTable 
FROM MyTable
LEFT OUTER JOIN (
   SELECT MIN(RowId) as RowId, Col1, Col2, Col3 
   FROM MyTable 
   GROUP BY Col1, Col2, Col3
) as KeepRows ON
   MyTable.RowId = KeepRows.RowId
WHERE
   KeepRows.RowId IS NULL

edited May 23 '17 at 12:07

Community

1
1

answered Sep 18 '08 at 14:26

Seibar

68,705
38
88
99

score 2 · Answer 6 · answered Sep 18 '08 at 11:38

This is a tough situation to be in. Without knowing your particular situation (table size etc) I think that your best shot is to add an identity column, populate it and then delete according to it. You may remove the column later but I would suggest that you should keep it as it is really a good thing to have in the table

score 0 · Answer 7 · edited Nov 25 '15 at 12:08

0

How about:

select distinct * into #t from duplicates_tbl

truncate duplicates_tbl

insert duplicates_tbl select * from #t

drop table #t

edited Nov 25 '15 at 12:08

Sabyasachi Mishra

1,677
2
31
49

answered Sep 19 '08 at 13:53

Brann · Answer 8 · 2009-03-04T10:47:24.800

What about this solution :

First you execute the following query :

  select 'set rowcount ' + convert(varchar,COUNT(*)-1) + ' delete from MyTable where field=''' + field +'''' + ' set rowcount 0'  from mytable group by field having COUNT(*)>1

And then you just have to execute the returned result set

set rowcount 3 delete from Mytable where field='foo' set rowcount 0
....
....
set rowcount 5 delete from Mytable where field='bar' set rowcount 0

I've handled the case when you've got only one column, but it's pretty easy to adapt the same approach tomore than one column. Let me know if you want me to post the code.

score 0 · Answer 9 · answered Sep 18 '08 at 12:45

After you clean up the current mess you could add a primary key that includes all the fields in the table. that will keep you from getting into the mess again. Of course this solution could very well break existing code. That will have to be handled as well.

score 0 · Answer 10 · answered Sep 18 '08 at 14:28

0

Can you add a primary key identity field to the table?

answered Sep 18 '08 at 14:28

Seibar

68,705
38
88
99

score 0 · Answer 11 · answered Sep 18 '08 at 15:17

Manrico Corazzi - I specialize in Oracle, not MS SQL, so you'll have to tell me if this is possible as a performance boost:-

Leave the same as your first step - insert distinct values into TABLE2 from TABLE1.
Drop TABLE1. (Drop should be faster than delete I assume, much as truncate is faster than delete).
Rename TABLE2 as TABLE1 (saves you time, as you're renaming an object rather than copying data from one table to another).

score 0 · Answer 12 · answered Sep 18 '08 at 17:51

Here's another way, with test data

create table #table1 (colWithDupes1 int, colWithDupes2 int)
insert into #table1
(colWithDupes1, colWithDupes2)
Select 1, 2 union all
Select 1, 2 union all
Select 2, 2 union all
Select 3, 4 union all
Select 3, 4 union all
Select 3, 4 union all
Select 4, 2 union all
Select 4, 2 


select * from #table1

set rowcount 1
select 1

while @@rowcount > 0
delete #table1  where 1 < (select count(*) from #table1 a2 
   where #table1.colWithDupes1 = a2.colWithDupes1
and #table1.colWithDupes2 = a2.colWithDupes2
)

set rowcount 0

select * from #table1

score -1 · Answer 13 · answered Sep 18 '08 at 12:51

I'm not sure if this works with DELETE statements, but this is a way to find duplicate rows:

 SELECT *
 FROM myTable t1, myTable t2
 WHERE t1.field = t2.field AND t1.id > t2.id

I'm not sure if you can just change the "SELECT" to a "DELETE" (someone wanna let me know?), but even if you can't, you could just make it into a subquery.

How can I delete duplicate rows in a table

13 Answers13

Linked

Related