T-SQL delete duplicates based on a column

Question

The screenshot shows the first 8 rows of a table. For the same id (each id has thousands of rows), based on the same "updatetime", I only want to keep the first row, delete the rest row(s). Here for example, I want to delete 3rd, 5th, 8th rows. All columns of two rows could be exactly the same (here when updatetime are the same, the UpdateMillisec are different, but not necessary). The screenshot is the result of a query, and I don't have the primary key now (the left-most column in the screenshot is not available in the table now). What SQL code should I write? Thanks in advance!

Screeshot

check here https://stackoverflow.com/questions/18390574/how-to-delete-duplicate-rows-in-sql-server — The beginner, Sep 13 '17 at 05:35
If you don't have a primary key and "UpdateMillisec are different, but not necessary", then you *wil*l have a hard time identifying the records you want to delete. You could still use something like `top()` to limit the delete operation to a limited number of records each time you apply the `delete` clause. A proper primary key would help a lot. — Carsten Massmann, Sep 13 '17 at 06:07

Esteban P. · Accepted Answer · 2017-09-13T06:15:01.740

There is an easy way to delete duplicate rows.

In a first step, we will sort the records and add a rownumber.
Second step will be deleting rows with rownumber > 1.

WITH CTE AS
(
SELECT  *
       ,ROW_NUMBER() OVER 
                (PARTITION BY id, updatetime
                     ORDER BY id, updatetime, UpdateMillisec ASC
                     ) AS RowNum
  FROM yourtable

)
SELECT * FROM CTE                    -- for checking the result before deleting
-- DELETE FROM CTE WHERE RowNum > 1  -- uncomment this row for the final DELETE

Attention:
To identify, which is the first record and which is a following (second, third,..) record, we have to sort the data.
Before deleting them, always check the resultset with a SELECT * FROM CTE first

In your case i checked the resultset of the above query, which is:

id  lastprice   updatetime          UpdateMillisec  RowNum
211709  51370   09:30:00.0000000    500             1
211709  51380   09:30:01.0000000    0               1
211709  51370   09:30:01.0000000    500             2
211709  51370   09:30:02.0000000    0               1
211709  51370   09:30:02.0000000    500             2
211709  51370   09:30:03.0000000    0               1
211709  51370   09:30:04.0000000    0               1
211709  51370   09:30:04.0000000    500             2

As we can see, exactly those records, which you want to delete, have RowNum = 2. So finally we can change the SELECT * to a DELETE and execute the query again.

Thanks for the solution. Did not know it was possible to directly delete records from a CTE! — Carsten Massmann, Sep 13 '17 at 06:26
You're welcome. I found this handy solution some time ago on stackoverflow and use it every time i've to delete duplicates :) — Esteban P., Sep 13 '17 at 06:27

score 0 · Answer 2 · answered Sep 13 '17 at 06:07

Give a row number partitioned by the columns and order by time columns and then delete the unwanted rows.

Query

;with cte as(
    select [rn] = row_number() over(
        partition by [id], [lastprice], [updatetime] 
        order by [id], [updatetime], [updateMillisec]
    ), *
    from [your_table_nam]
)
select * from cte -- first select and check whether these are the rows that has to be deleted
where [rn] > 1;

If ok, then delete the rows having [rn] greater than 1.

delete from cte
where [rn] > 1;

score 0 · Answer 3 · answered Sep 13 '17 at 06:28

I like @Estban P.'s solution. And I was tempted to try further. It turns out to be possible to do it this way too:

DELETE seq FROM (SELECT ROW_NUMBER() 
       OVER(PARTITION BY id, updatetime ORDER BY id, updatetime, updatems ASC) AS RowNum
FROM tbl ) seq where rownum>1;

So, you don't even have to use a CTE, see the demo here http://rextester.com/VLZOD12591

T-SQL delete duplicates based on a column

3 Answers3