I'm attempting to remove redundant rows from an SQL table, [InfoBucket], with columns:
[ID] (varchar(16)), [column1], ... [columnN], [Speed] (bigint)
([column1...N] are datatypes ranging from integers to varchar() objects of varying lengths.)
There are rows in the table that have the same value in the [ID] and some [column1...N] columns. I'm taking all these duplicates and deleting all but the row that has the greatest [Speed].
There are approximately 400 million rows in the [InfoBucket].
To split the work into manageable chunks, I have another table, [UniqueIDs], with one column:
[ID] (varchar(16))
and which is populated like so:
begin
insert into [UniqueIDs]
select distinct [ID] from [InfoBucket]
end
go
There are approximately 15 million rows in [UniqueIDs].
I have been using using Martin Smiths excellent answer to a similar question:
My procedure currently looks like this:
begin
declare @numIDs int
set @numIDs = 10000
;with toRemove as
(
select ROW_NUMBER over (partition by
[ID],
[column1],
...
[columnN]
order by [Speed] desc) as 'RowNum'
from [InfoBucket]
where [ID] in
(
select top (@numIDs) [ID] from [UniqueIDs] order by [ID]
)
)
delete toRemove
where RowNum > 1
option (maxdop 1)
;
;with IDsToRemove as
(
select top (@numIDs) [ID] from [UniqueIDs] order by [ID]
)
delete IDsToRemove
option (maxdop 1)
end
go
There are nonclustered indexes on [ID]
in both [InfoBucket]
and [UniqueIDs]
, and the "partition by ..." in the over clause only includes the columns that need to be compared.
Now, my problem is that it takes a little over six minutes for this procedure to run. Adjusting the value of @numIDs
changes the running time in a linear fashion (ie. when @numIDs
has a value of 1,000 the procedure runs for approximately 36 seconds (6 min. / 10) and when @numIDs
has a value of 1,000,000 the procedure runs for approximately 10 hours (6 min. * 100); this means that removing all duplicates in [InfoBucket] takes days.
I've tried adding a uniqueidentifier
column, [UI_ID]
to [InfoBucket]
and creating a clustered index on it (so [InfoBucket]
had one clustered index on [UI_ID]
and one nonclustered on [ID]) but that actually increased running time.
Is there any way I can further optimize this?