0

I have a database table with 33265902 (thirty-three million two hundred sixty-five thousand nine hundred two) records in it.

Much of this is made up of duplicates.

It is causing me issues as any website that queries this table is causing load on the server.

What makes this table unique is two fields, email_address and job_id

My question is what is the best way to delete all duplicates leaving one good record per field criteria.

What is the most efficient way to do this without killing the server.

Matthew Stott
  • 395
  • 1
  • 4
  • 15
  • Possible duplicate of [How to delete duplicate rows in SQL Server?](https://stackoverflow.com/questions/18390574/how-to-delete-duplicate-rows-in-sql-server) – Thom A Jul 23 '19 at 10:17
  • Possible duplicate of [How to delete duplicate rows in SQL Server?](https://stackoverflow.com/questions/18390574/how-to-delete-duplicate-rows-in-sql-server) – MusicLovingIndianGirl Jul 23 '19 at 10:19

6 Answers6

1

i would prefer creatin a new table from the existing one by using distinct value and then drop the old table then rename the new created table, so all the steps will be

1. select distinct columns and insert into new_table
2. drop old_table
3. rename new_table to old_table
Zaynul Abadin Tuhin
  • 31,407
  • 5
  • 33
  • 63
1

You can use limit

DELETE FROM table [WHERE conditions] LIMIT row_count

call this query every 5mins. Just limit to 10000 data or add a little bit more

1

The most efficient way is probably to keep one row per table in a new table. Then truncate the old table and re-insert the values:

select t.*  -- or select all columns except seqnum
into temp_table
from (select t.*,
             row_number() over (partition by email_address, job_id order by job_id) as seqnum
      from t
     ) t
where seqnum = 1;

-- optional if you remove seqnum above
-- alter table temp_table drop column seqnum;


-- back this up first!
truncate table t;

insert to t
    select * from temp_table;

You could drop the original table and rename temp_table to the table name. If you do so, remember to create constraints, indexes, triggers, partitions and whatever else is on the original table.

This will not be super fast (you can see how long it takes to create the temporary table). However, it does not require the locking and logging from deleting the original table. As a bonus, the new table should be less fragmented than if you did deletes.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
0

This may help...

;WITH DuplicateData
AS(
SELECT *,DENSE_RANK() OVER(PARTITION BY email_address, job_id ORDER BY email_address, job_id ASC) [Rank]
FROM TableName
)
DELETE d
FROM DuplicateData d WHERE Rank > 1
GO

Check this: https://stackoverflow.com/a/56982523/5496498

Khairul Alam
  • 1,266
  • 2
  • 11
  • 31
0

You can try this: (replace table_name with your table name)

    WITH cte AS (    
    SELECT job_id , email_address ,    
    row_number() OVER(PARTITION BY job_id , email_address order by job_id ) AS [rn]    
    FROM table_name    
 )    
 DELETE cte WHERE [rn] > 1  

another method using temp table :

 select distinct * into ‪#‎tmptbl‬ From table_name    
 delete from table_name    
 insert into table_name    
 select * from #tmptbl drop table #tmptbl   
0

I went with Gordon Lindoffs suggestion the end and seems to have worked great.

My issue now is I am left with loads of space I need to claim back for the os

Is shrinking the dB the right thing?

Matthew Stott
  • 395
  • 1
  • 4
  • 15