Removing Duplicates - only all but the most recently dated row

Question

Possible Duplicate:
How can I find duplicate entries and delete the oldest ones in SQL?

I have a database which has a few thousand duplicates due to a faulty update tool. I am able to identify the collections of items with duplicates, but need to delete only the oldest entries, not necessarily the lowest id. Test data looks like this, correct row has an *

The articles with duplicate titles which do not have duplicate ruleids should be deleted except for the most recently created rows. (actual id column is a GUID so I cannot assume auto-increment)

Id           Article id          Rule Id         Title          Opened Date
--           ----------          -------         -----          -----------
1*           111                 5               T1             2013-01-20
2            112                 5               T1             2013-07-01
3*           113                 6               T2             2013-07-01
4*           114                 7               T2             2013-07-02
5            115                 8               T3             2012-07-01
6            116                 8               T3             2013-01-20
7*           117                 8               T3             2013-01-21

Table Schema:

CREATE TABLE [dbo].[test_ai](
    [id] [int] NOT NULL,
    [ArticleId] [varchar](50) NOT NULL,
    [ruleid] [varchar](50) NULL,
    [Title] [nvarchar](max) NULL,
    [AuditData_WhenCreated] [datetime] NULL,
PRIMARY KEY CLUSTERED 
(
    [id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
)

Test Data Inserts

insert into test_ai (id, articleid, ruleid, title, auditdata_whencreated) values (1, 111, 5, 'test 1', '2013-01-20')
insert into test_ai (id, articleid, ruleid, title, auditdata_whencreated) values (2, 112, 5, 'test 1', '2012-07-01')
insert into test_ai (id, articleid, ruleid, title, auditdata_whencreated) values (3, 113, 6, 'test 2', '2012-07-01')
insert into test_ai (id, articleid, ruleid, title, auditdata_whencreated) values (4, 114, 7, 'test 2', '2012-07-02')
insert into test_ai (id, articleid, ruleid, title, auditdata_whencreated) values (5, 115, 8, 'test 3', '2012-07-01')
insert into test_ai (id, articleid, ruleid, title, auditdata_whencreated) values (6, 116, 8, 'test 3', '2013-01-20')
insert into test_ai (id, articleid, ruleid, title, auditdata_whencreated) values (7, 117, 8, 'test 3', '2013-01-21')

My current query looks like this

select * from test_ai
where test_ai.id in

-- set 1 - all rows with duplicates
(select f.id 
from test_ai as F 
WHERE exists (select ruleid, title, count(id)   
FROM test_ai
    WHERE test_ai.title = F.title
        AND test_ai.ruleid = F.ruleid
    GROUP BY test_ai.title, test_ai.ruleid
    having count(test_ai.id) > 1))
    and test_ai.id not in

-- set 2 - includes one row from each set of duplicates
(select min(id)
from test_ai as F
WHERE EXISTS (select ruleid, title, count(id)
from test_ai
WHERE test_ai.title = F.title 
    AND test_ai.ruleid = F.ruleid
group by test_ai.title, test_ai.ruleid
HAVING count(test_ai.id) > 1)   
GROUP BY title, ruleid
)

This SQL identifies some of the rows that should be deleted (rows 2,6,7), but it does choose the oldest article by 'opened date.' (should delete rows 2,5,6) I realize I am not specifying this to the statement, but am struggling with how to add this last piece. If it results in a script that I need to run more than once to delete duplicates when there are more than single duplicates, that is not a problem.

The actual problem is significantly more complicated, but if I can get past this one blocking part, I'll be able to move forward again. Thanks for taking a look!

I think this would be of help to you: http://jzinedine.me/post/30604785957/a-flexible-way-to-delete-duplicate-rows-in-sql — Jahan Zinedine, Jan 21 '13 at 23:23
Based on what rows the description said you wanted to delete, shouldn't the title of this question be "only keep newest row" or "delete all but the newest row"? Right now the title doesn't match your actual requirements. — Aaron Bertrand, Jan 21 '13 at 23:35

Aaron Bertrand · Accepted Answer · 2013-01-21T23:30:43.847

4

The typical model for deleting one row from a set (or from each group in a set) in SQL Server 2005+ is:

;WITH cte AS 
(
  SELECT col, rn = ROW_NUMBER() OVER 
    (PARTITION BY something ORDER BY something)
  FROM dbo.base_table
  WHERE ...
)
DELETE x WHERE rn = 1;

In your case this would be:

;WITH cte AS 
(
  SELECT id, ruleid, Title, rn = ROW_NUMBER() OVER 
  (
     PARTITION BY ruleid, Title  
     ORDER BY auditdata_whencreated DESC
  )
  FROM dbo.test_ai
)
DELETE cte 
  OUTPUT deleted.id
  WHERE rn > 1;

Results:

id
----
2
6
5

edited Jan 21 '13 at 23:30

answered Jan 21 '13 at 23:23

Aaron Bertrand

272,866
37
466
490

@mellamokb But I think this answer is simpler, like what I've mentioned in question comments. – Jahan Zinedine Jan 21 '13 at 23:30
@Jani: Aren't our answers all exactly the same idea? – mellamokb Jan 21 '13 at 23:31
1

@mellamokb hey man! You updated the fiddle in the meantime :-D – Jahan Zinedine Jan 21 '13 at 23:32
Found something similar at the same time taking an example from http://stackoverflow.com/questions/679855/how-can-i-find-duplicate-entries-and-delete-the-oldest-ones-in-sql – Chris Ballance Jan 21 '13 at 23:32
1

@mellamokb if your fiddle is the same as what I posted, then what is the conversation about in the first place? "Me too! Me too!" – Aaron Bertrand Jan 21 '13 at 23:33

Removing Duplicates - only all but the most recently dated row

1 Answers1