Optimizing removal of SQL duplicates using ROW_NUMBER

Question

I'm attempting to remove redundant rows from an SQL table, [InfoBucket], with columns:

[ID] (varchar(16)), [column1], ... [columnN], [Speed] (bigint)

([column1...N] are datatypes ranging from integers to varchar() objects of varying lengths.)

There are rows in the table that have the same value in the [ID] and some [column1...N] columns. I'm taking all these duplicates and deleting all but the row that has the greatest [Speed].

There are approximately 400 million rows in the [InfoBucket].

To split the work into manageable chunks, I have another table, [UniqueIDs], with one column:

[ID] (varchar(16))

and which is populated like so:

begin

insert into [UniqueIDs]

select distinct [ID] from [InfoBucket]

end

go

There are approximately 15 million rows in [UniqueIDs].

I have been using using Martin Smiths excellent answer to a similar question:

My procedure currently looks like this:

begin

declare @numIDs int

set @numIDs = 10000

;with toRemove as
(

select ROW_NUMBER over (partition by

[ID],
[column1],
...
[columnN]

order by [Speed] desc) as 'RowNum'

from [InfoBucket]

where [ID] in
(

select top (@numIDs) [ID] from [UniqueIDs] order by [ID]

)

)

delete toRemove

where RowNum > 1

option (maxdop 1)

;

;with IDsToRemove as
(

select top (@numIDs) [ID] from [UniqueIDs] order by [ID]

)

delete IDsToRemove

option (maxdop 1)

end

go

There are nonclustered indexes on [ID] in both [InfoBucket] and [UniqueIDs], and the "partition by ..." in the over clause only includes the columns that need to be compared.

Now, my problem is that it takes a little over six minutes for this procedure to run. Adjusting the value of @numIDs changes the running time in a linear fashion (ie. when @numIDs has a value of 1,000 the procedure runs for approximately 36 seconds (6 min. / 10) and when @numIDs has a value of 1,000,000 the procedure runs for approximately 10 hours (6 min. * 100); this means that removing all duplicates in [InfoBucket] takes days.

I've tried adding a uniqueidentifier column, [UI_ID] to [InfoBucket] and creating a clustered index on it (so [InfoBucket] had one clustered index on [UI_ID] and one nonclustered on [ID]) but that actually increased running time.

Is there any way I can further optimize this?

score 0 · Answer 1 · answered Apr 04 '12 at 12:50

The key is to find the sweet spot for deleting the rows. Play with @numIds to find the fastest increment, and then just let it churn.

It's 400 million rows, it's not going to complete the whole process in minutes, maybe hours, it's going to take time. As long as the table does not fill faster then you can remove dupes, you are ok.

Find the sweet spot, then schedule it to run often and off peak. Then check the process from time to time to make sure the sweet spot stays sweet.

The only other thing i can think of, is to calculate the dupes outside of deleting them. This will save some time. Especially if you can calculate the dupes in one sql statement then put that data into yet another table (ie DupeIdsToDelete, then run a delete loop against those IDs)

Optimizing removal of SQL duplicates using ROW_NUMBER

1 Answers1