Possible Duplicate:
How can I find duplicate entries and delete the oldest ones in SQL?
I have a database which has a few thousand duplicates due to a faulty update tool. I am able to identify the collections of items with duplicates, but need to delete only the oldest entries, not necessarily the lowest id. Test data looks like this, correct row has an *
The articles with duplicate titles which do not have duplicate ruleids should be deleted except for the most recently created rows. (actual id column is a GUID so I cannot assume auto-increment)
Id Article id Rule Id Title Opened Date
-- ---------- ------- ----- -----------
1* 111 5 T1 2013-01-20
2 112 5 T1 2013-07-01
3* 113 6 T2 2013-07-01
4* 114 7 T2 2013-07-02
5 115 8 T3 2012-07-01
6 116 8 T3 2013-01-20
7* 117 8 T3 2013-01-21
Table Schema:
CREATE TABLE [dbo].[test_ai](
[id] [int] NOT NULL,
[ArticleId] [varchar](50) NOT NULL,
[ruleid] [varchar](50) NULL,
[Title] [nvarchar](max) NULL,
[AuditData_WhenCreated] [datetime] NULL,
PRIMARY KEY CLUSTERED
(
[id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
)
Test Data Inserts
insert into test_ai (id, articleid, ruleid, title, auditdata_whencreated) values (1, 111, 5, 'test 1', '2013-01-20')
insert into test_ai (id, articleid, ruleid, title, auditdata_whencreated) values (2, 112, 5, 'test 1', '2012-07-01')
insert into test_ai (id, articleid, ruleid, title, auditdata_whencreated) values (3, 113, 6, 'test 2', '2012-07-01')
insert into test_ai (id, articleid, ruleid, title, auditdata_whencreated) values (4, 114, 7, 'test 2', '2012-07-02')
insert into test_ai (id, articleid, ruleid, title, auditdata_whencreated) values (5, 115, 8, 'test 3', '2012-07-01')
insert into test_ai (id, articleid, ruleid, title, auditdata_whencreated) values (6, 116, 8, 'test 3', '2013-01-20')
insert into test_ai (id, articleid, ruleid, title, auditdata_whencreated) values (7, 117, 8, 'test 3', '2013-01-21')
My current query looks like this
select * from test_ai
where test_ai.id in
-- set 1 - all rows with duplicates
(select f.id
from test_ai as F
WHERE exists (select ruleid, title, count(id)
FROM test_ai
WHERE test_ai.title = F.title
AND test_ai.ruleid = F.ruleid
GROUP BY test_ai.title, test_ai.ruleid
having count(test_ai.id) > 1))
and test_ai.id not in
-- set 2 - includes one row from each set of duplicates
(select min(id)
from test_ai as F
WHERE EXISTS (select ruleid, title, count(id)
from test_ai
WHERE test_ai.title = F.title
AND test_ai.ruleid = F.ruleid
group by test_ai.title, test_ai.ruleid
HAVING count(test_ai.id) > 1)
GROUP BY title, ruleid
)
This SQL identifies some of the rows that should be deleted (rows 2,6,7), but it does choose the oldest article by 'opened date.' (should delete rows 2,5,6) I realize I am not specifying this to the statement, but am struggling with how to add this last piece. If it results in a script that I need to run more than once to delete duplicates when there are more than single duplicates, that is not a problem.
The actual problem is significantly more complicated, but if I can get past this one blocking part, I'll be able to move forward again. Thanks for taking a look!