Delete duplicates from large dataset (>100Mio rows)

Question

I know that this topic came up many times before here but none of the suggested solutions worked for my dataset because my laptop stopped calculating due to memory issues or full storage.

My table looks like the following and has 108 Mio rows:

Col1       |Col2   |  Col3           |Col4   |SICComb |  NameComb 

Case New   |3523   |  Alexander      |6799   |67993523| AlexanderCase New 
Case New   |3523   |  Undisclosed    |6799   |67993523| Case NewUndisclosed 
Undisclosed|6799   |  Case New       |3523   |67993523| Case NewUndisclosed 
Case New   |3523   |  Undisclosed    |6799   |67993523| Case NewUndisclosed 
SmartCard  |3674   |  NEC            |7373   |73733674| NECSmartCard 
SmartCard  |3674   |  Virtual NetComm|7373   |73733674| SmartCardVirtual NetComm 
SmartCard  |3674   |  NEC            |7373   |73733674| NECSmartCard

The unique columns are SICComb and NameComb. I tried to add a primary key with:

ALTER TABLE dbo.test ADD ID INT IDENTITY(1,1)

but the integers are filling up more than 30 GB of my storage just in a new minutes.

Which would be the fastest and most efficient method to delete the duplicates from the table?

@user2713440 what is your definition of duplicated? When all columns are identical? Or when SICComb and NameComb are identical? — Luis LL, Aug 24 '13 at 12:05

Roman Pekar · Answer 1 · 2013-08-28T08:53:20.977

8

If you're using SQL Server, you can use delete from common table expression:

with cte as (
    select row_number() over(partition by SICComb, NameComb order by Col1) as row_num
    from Table1
)
delete
from cte
where row_num > 1

Here all rows will be numbered, you get own sequence for each unique combination of SICComb + NameComb. You can choose which rows you want to delete by choosing order by inside the over clause.

edited Aug 28 '13 at 08:53

answered Aug 24 '13 at 11:58

Roman Pekar

107,110
28
195
197

3

@ShahgholiArdalan don't touch my code. I believe that tradition of using large letters in SQL should go away, I always format my code for readability, don't make my answer worse! – Roman Pekar Aug 28 '13 at 08:55
This seems to work better than the accepted answer for removing duplicates from tables with large numbers of columns and rows, and a column that holds a unique identifier. – Malcolm Oct 07 '16 at 13:43
"Not all queries benefit, some can go dramatically backwards. Can depend if your index is aligned or not aligned. In general aligned indexes are marginally slower to access. Sometimes all partitioned index b-trees accessed to find key. SORT / TOP queries can be devastatingly slow. When Search key not partitioned as leading segment" - sqlsaturday.com – Sam Aug 20 '18 at 17:55

Gordon Linoff · Accepted Answer · 2013-08-24T13:35:54.940

2

In general, the fastest way to delete duplicates from a table is to insert the records -- without duplicates -- into a temporary table, truncate the original table and insert them back in.

Here is the idea, using SQL Server syntax:

select distinct t.*
into #temptable
from t;

truncate table t;

insert into t
    select tt.*
    from #temptable;

Of course, this depends to a large extent on how fast the first step is. And, you need to have the space to store two copies of the same table.

Note that the syntax for creating the temporary table differs among databases. Some use the syntax of create table as rather than select into.

EDIT:

Your identity insert error is troublesome. I think you need to remove the identity from the list of columns for the distinct. Or do:

select min(<identity col>), <all other columns>
from t
group by <all other columns>

If you have an identity column, then there are no duplicates (by definition).

In the end, you will need to decide which id you want for the rows. If you can generate a new id for the rows, then just leave the identity column out of the column list for the insert:

insert into t(<all other columns>)
    select <all other columns>;

If you need the old identity value (and the minimum will do), turn off identity insert and do:

insert into t(<all columns including identity>)
    select <all columns including identity>;

edited Aug 24 '13 at 13:35

answered Aug 24 '13 at 11:41

Gordon Linoff

1,242,037
58
646
786

1

+1, I like this solution as it is clean and kind of best practice. but I would also add a comment, before doing this procedure to gather all information/statistics, could be that amount of duplicates are few and faster/better to just delete them instead of inserting millions of rows and back and forth... – MrSimpleMind Aug 24 '13 at 12:18
@MrSimpleMind . . . You are correct. If there are just a handful of duplicates (say thousands in a table of millions of rows), it would be better to use `delete` directly. – Gordon Linoff Aug 24 '13 at 12:52
@Gordon Linoof: Thanks for your answer. Unfortunately, it displays that my Identify_INSERT is not turned on. The result of running the code is an empty table. – user2713440 Aug 24 '13 at 13:32
@Gordon Linoof: Your edit helped. But now the error message "Column name or number of supplied values does not match table definition" is popping up. – user2713440 Aug 24 '13 at 16:53
@user2713440 . . . You need to be sure that the column list for `insert` and the list after the `select` have the same columns in the same order. – Gordon Linoff Aug 24 '13 at 18:53

Delete duplicates from large dataset (>100Mio rows)

2 Answers2

Linked