316

I have a table with a very large amount of rows. Duplicates are not allowed but due to a problem with how the rows were created I know there are some duplicates in this table. I need to eliminate the extra rows from the perspective of the key columns. Some other columns may have slightly different data but I do not care about that. I still need to keep one of these rows however. SELECT DISTINCT won't work because it operates on all columns and I need to suppress duplicates based on the key columns.

How can I delete the extra rows but still keep one efficiently?

Samuel Liew
  • 76,741
  • 107
  • 159
  • 260
nuit9
  • 3,181
  • 3
  • 16
  • 5

3 Answers3

604

You didn't say what version you were using, but in SQL 2005 and above, you can use a common table expression with the OVER Clause. It goes a little something like this:

WITH cte AS (
  SELECT[foo], [bar], 
     row_number() OVER(PARTITION BY foo, bar ORDER BY baz) AS [rn]
  FROM TABLE
)
DELETE cte WHERE [rn] > 1

Play around with it and see what you get.

(Edit: In an attempt to be helpful, someone edited the ORDER BY clause within the CTE. To be clear, you can order by anything you want here, it needn't be one of the columns returned by the cte. In fact, a common use-case here is that "foo, bar" are the group identifier and "baz" is some sort of time stamp. In order to keep the latest, you'd do ORDER BY baz desc)

Baldy
  • 3,621
  • 4
  • 38
  • 60
Ben Thul
  • 31,080
  • 4
  • 45
  • 68
  • 6
    This will retain the last duplicate row or the first row ? – SUN Jiangong Dec 22 '11 at 15:48
  • 19
    Just got pulled back to this answer and noticed the question re: which dupe will it retain. As written, it will retain the "first" duplicate row, where "first" means "lowest ordering according to baz". Of course, if ever you're unsure of what will be deleted/retained, turn the delete into a select and make sure. Better safe than sorry. – Ben Thul Sep 19 '12 at 01:03
  • 3
    Don't forget the semi colon before the WITH if its being executed in a batch ie. transaction https://msdn.microsoft.com/en-us/library/ms175972.aspx – Mike1234 Jul 20 '15 at 23:54
  • @Mike1234: That's sort of true, but sloppy. The rule is that the previous statement needs to end with a semi-colon. Convince yourself of the truth of this by putting a CTE as the first thing in a batch. – Ben Thul Jul 21 '15 at 00:07
  • do you actually need to select anything but the row number in the CTE? – SumGuy Sep 17 '15 at 18:42
  • 2
    @SumGuy: No; the row number is sufficient. But I like to run this as a select first to inspect what's going to be affected. Also, I just ran a quick test and it appears that SQL Server is smart enough to not carry unneeded columns forward. I determined this by looking at the output column list in an actual execution plan for both the case where I selected everything plus row number and only row number; the two were identical. – Ben Thul Sep 17 '15 at 18:55
  • 5
    With a high row count, the DELETE is probably not recommended (FULL recovery will also cause transaction log to fill). It's probably better to do a SELECT * INTO NewTable FROM cte and then just delete the old table. This will be much faster with very large tables. – Aaron Apr 06 '16 at 15:48
142

Example query:

DELETE FROM Table
WHERE ID NOT IN
(
SELECT MIN(ID)
FROM Table
GROUP BY Field1, Field2, Field3, ...
)

Here fields are column on which you want to group the duplicate rows.

Amit
  • 21,570
  • 27
  • 74
  • 94
  • 2
    Using this format I got the following error, any ideas? "ERROR 1093 (HY000): You can't specify target table 'Table' for update in FROM clause" – M1ke Aug 23 '13 at 11:24
  • 6
    @M1ke MySQL doesn't allow updates to the main table that are referenced from sub queries, but there's a workaround; change 'FROM Table' to 'FROM (SELECT * FROM Table) AS t1' this stores the table in a temporary table so it allows updates to the main table. – BigBadMe Jul 03 '14 at 07:43
  • 1
    Thanks, I actually found that same answer somewhere else but can't remember where - so have a plus 1! – M1ke Jul 03 '14 at 10:51
  • 4
    Nice.But what about if we don't have primary key? – ManirajSS Apr 22 '15 at 15:22
  • 1
    What about if primery key is guid, not integer – merdan Nov 21 '15 at 07:06
  • 1
    Beware of using DELETE in such a way with a "large amount of rows" as the OP suggested. It will take a lot of time due to disk IO if this is not an in memory or memory optimized table. Also, the recovery model matters, if set to FULL, you will potentially be writing a lot of data to transaction logs which adds even more time (and can fill up your drive if not careful). Group by / min or DISTINCT * is probably safest, but CTE / ROW_NUMBER is acceptable and probably the fastest as long as RAM allows it without paging to disk (again, another huge time consumer). – Aaron Apr 06 '16 at 15:45
  • 2
    @merdan, it works with anything that's sortable. e.g. the following is valid `select min(id) from ( select newid() as id union select newid() as id ) as a` – iCodeSometime Nov 15 '18 at 22:41
  • @BigBadMe Thanks to your comment its working with improvement in the above query. As the change in "change 'FROM Table' to 'FROM (SELECT * FROM Table) AS t1' " – Jagdeep Singh Feb 18 '21 at 06:28
32

Here's my twist on it, with a runnable example. Note this will only work in the situation where Id is unique, and you have duplicate values in other columns.

DECLARE @SampleData AS TABLE (Id int, Duplicate varchar(20))

INSERT INTO @SampleData
SELECT 1, 'ABC' UNION ALL
SELECT 2, 'ABC' UNION ALL
SELECT 3, 'LMN' UNION ALL
SELECT 4, 'XYZ' UNION ALL
SELECT 5, 'XYZ'

DELETE FROM @SampleData WHERE Id IN (
    SELECT Id FROM (
        SELECT 
            Id
            ,ROW_NUMBER() OVER (PARTITION BY [Duplicate] ORDER BY Id) AS [ItemNumber]
            -- Change the partition columns to include the ones that make the row distinct
        FROM 
            @SampleData
    ) a WHERE ItemNumber > 1 -- Keep only the first unique item
)

SELECT * FROM @SampleData

And the results:

Id          Duplicate
----------- ---------
1           ABC
3           LMN
4           XYZ

Not sure why that's what I thought of first... definitely not the simplest way to go but it works.

Cᴏʀʏ
  • 105,112
  • 20
  • 162
  • 194
  • 2
    This doesn't preserve one original in duplicates. This deletes the original as well. – Sandy May 18 '15 at 17:12
  • 1
    Hi @Sandy, have you verified this? I answered four years ago, I can't remember if I tested it on real data or not. – Cᴏʀʏ May 18 '15 at 17:14
  • 1
    Yes, I checked on real data. This will delete the original, as well. – Sandy May 18 '15 at 17:22
  • 1
    @Sandy, I just tested my query against some sample data, and it works fine. See my edit to this answer for a runnable example. My thoughts are maybe you didn't have the `ROW_NUMBER()` function applied correctly. – Cᴏʀʏ May 18 '15 at 17:45
  • 1
    Oh. I was expecting even Id to be duplicate. So, the rows are to be (1,ABC) ,(1,ABC), (3,LMN), (3,LMN). I was looking for an answer for such scenario. – Sandy May 18 '15 at 17:50
  • 1
    Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/78111/discussion-between-sandy-and-c). – Sandy May 18 '15 at 18:24
  • 2
    Can we please delete this post because unless you test and read the comments, i's BLOODY dangerous! – Fandango68 Dec 03 '18 at 05:24
  • 5
    @Fandango68: I believe I've explained the risks in the body of the post. Copy and pasting random Internet code snippets is a dangerous endeavor. You're more than welcome to vote to delete the post to see if the community agrees. – Cᴏʀʏ Dec 03 '18 at 15:36