2

I have an issue. I have a table with almost 2 billion rows (yeah I know...) and has a lot of duplicate data in it which I'd like to delete from it. I was wondering how to do that exactly?

The columns are: first, last, dob, address, city, state, zip, telephone and are in a table called PF_main. Each record does have a unique ID thankfully, and its in column called ID.

How can I dedupe this and leave 1 unique entry (row) within the pf_main table for each person??

Thank you all in advance for your responses...

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Sal
  • 295
  • 3
  • 5
  • 13
  • See this question for a few different ways to do this: http://stackoverflow.com/questions/6025367/t-sql-deleting-all-duplicate-rows-but-keeping-one – Cameron Bell Aug 07 '14 at 01:41
  • Only issue is there are tons of other fields in there that I want to ignor so I kind of need something custom for this... – Sal Aug 07 '14 at 01:46
  • Do you have to do this as an offline or online operation? E.g. Must the table remain available for read/write access? – Michael Petito Aug 07 '14 at 01:47
  • 3
    What exactly do you consider a "duplicate"? Same first and last name, same first / last / dob, or same ? – Aaron Bertrand Aug 07 '14 at 01:48
  • online access and by duplicate I mean all those fields I listed having unique data in them. – Sal Aug 07 '14 at 02:46
  • 1
    One other aspect to consider is if your address field is normalized. Two records may be identical except one address is 123 Nice St while the other address is 123 Nice Street. Similar issues may occur with other fields. – Michael Petito Aug 07 '14 at 14:10

5 Answers5

9
SELECT 
   ID, first, last, dob, address, city, state, zip, telephone, 
   ROW_NUMBER() OVER (PARTITION BY first, last, dob, address, city, state, zip, telephone ORDER BY ID) AS RecordInstance
FROM PF_main

will give you the "number" of each unique entry (sorted by Id)

so if you have the following records:

id, last, first, dob, address, city, state, zip, telephone
006, trevelyan, alec, '1954-05-15', '85 Albert Embankment', 'London', 'UK', '1SE1 7TP', 0064
007, bond, james, '1957-02-08', '85 Albert Embankment', 'London', 'UK', '1SE1 7TP', 0074
008, bond, james, '1957-02-08', '85 Albert Embankment', 'London', 'UK', 'SE1 7TP', 0074
009, bond, james, '1957-02-08', '85 Albert Embankment', 'London', 'UK', 'SE1 7TP', 0074

you will get the following results (note last column)

006, trevelyan, alec, '1954-05-15', '85 Albert Embankment', 'London', 'UK', '1SE1 7TP', 0064, 1
007, bond, james, '1957-02-08', '85 Albert Embankment', 'London', 'UK', '1SE1 7TP', 0074, 1
008, bond, james, '1957-02-08', '85 Albert Embankment', 'London', 'UK', 'SE1 7TP', 0074, 2
009, bond, james, '1957-02-08', '85 Albert Embankment', 'London', 'UK', 'SE1 7TP', 0074, 3

So you can just delete records with RecordInstance > 1:

WITH Records AS
(
   SELECT 
      ID, first, last, dob, address, city, state, zip, telephone,
      ROW_NUMBER() OVER (PARTITION BY first, last, dob, address, city, state, zip, telephone ORDER BY ID) AS RecordInstance
   FROM PF_main
)
DELETE FROM Records
WHERE RecordInstance > 1
marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Mark Sowul
  • 10,244
  • 1
  • 45
  • 51
  • 1
    Thanks everyone! Seems like I have a lot of work to do. I'll let you know what happens :-) – Sal Aug 07 '14 at 02:48
3

Other answers will certainly give you ideas as far as syntax.

With 2 billions rows, your concerns may involve other things besides syntax, so I will give you a generic answer that works on many databases. If you cannot afford to do the delete or copy "online" in one session, or are low on space, then consider the following incremental approach.

Deletes this large can take a long, long time to complete, as in hours or even days, and also risk failing before completion. In a few cases, the approach that worked the best, surprisingly, was a rudimentary, long-running stored procedure that took small batches and committed every few records (few being a relative term here). Few might be 100 or 1000 or 10000 records. Sure it doesn't look elegant, but the point is it is "incremental" and a low resource consumer.

The idea is to identify a partioning key by which you can address ranges of records (to partition your working set down), or either do an initial query to identify the duplicate keys to another table. Then iterate through those keys one small batch at a time, delete, then commit, and repeat. If you do it live without a temp table, make sure to keep the ranges small by adding appropriate criteria to reduce the result sets, and keep the cursor or sort area sizes small.

-- Pseudocode
counter = 0;
for each row in dup table       -- and if this takes long, break this into ranges
   delete from primary_tab where id = @id
   if counter++ > 1000 then
      commit;
      counter = 0;
   end if
end loop

This stored proc can be stopped and restarted without worry of a giant rollback, and it will also run reliably for hours or days without major impact on the database availability. In Oracle this could be undo segments and sort area sizes as well as other things, but in MSSQL I am not an expert. Eventually, it will finish. In the meantime you aren't tying up the target table with locks or large transactions, and so DML can continue on the table. The caveat is, if DML continues on it, then you will potentially have to repeat the snapshot to the dup ids table to handle any dups that arise since the snapshot.

Caveat: This doesn't defrag free blocks/rows or coalesce deleted space like completely building a new table, but it does allow it to be done online and without allocating a new copy. On the other hand, if you have the freedom to do it online and/or a maintenance window, and the duplicate rows are greater than say 15-20% of your data, then you should opt for the "create table as select * from original minus duplicates" approach, as in Gordon's answer, in order to compressed the data segment into a densely utilized, contiguous segment, and get better cache/IO performance in the long run. Rarely are duplicates more than a fraction of a percentage of the space, however.

Reasons for doing this include:

1 - Table too large to create a temporary de-duped copy.

2 - You can't or don't want to drop the original table to do the swap once the new one is ready.

3 - You can't get a maintenance window to do one giant operation.

Otherwise, see Gordon Linoff's answer.

Community
  • 1
  • 1
codenheim
  • 20,467
  • 1
  • 59
  • 80
3

A 2 billion row table is quite big. Let me assume that first, last, and dob constitutes a "person". My suggestion is to build an index on the "person" and then do the truncate/re-insert approach.

In practice, this looks like:

create index idx_pf_main_first_last_dob on pf_main(first, last, dob);

select m.*
into temp_pf_main
from pf_main m
where not exists (select 1
                  from pf_main m2
                  where m2.first = m.first and m2.last = m.last and m2.dob = m.dob and
                        m2.id < m.id
                 );

truncate table pf_main;

insert into pf_main
    select *
    from temp_pf_main;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • +1 for this approach. It will result in a fresh, densely packed data segment as opposed to the delete in place approach. – codenheim Aug 07 '14 at 02:19
  • Awesome Gordon! I will give that a shot, but I'd have to index every field I assume yes? – Sal Aug 07 '14 at 02:47
  • @Sal . . . I was sort of hoping you had a unique identifier for a person. Just building the index may take a long time. – Gordon Linoff Aug 07 '14 at 03:02
  • Several points to consider in this approach: This select-into query could lock the pf_main table preventing inserts for the entire copy operation. You would lose records inserted between the completion of the copy operation and the completion of the copy-back operation. Your pf_main table will also appear to be empty for a period of time. You will generate significant transaction log activity with these copy operations. This option is only available if there are no foreign keys on pf_main. Note that an incremental approach via codenheim will address all of these concerns. – Michael Petito Aug 07 '14 at 12:41
  • @MichaelPetito . . . You should up vote that answer. – Gordon Linoff Aug 07 '14 at 12:58
  • Whats it mean when a multi-part identifier cannot be bound? I;m getting that error?? – Sal Aug 30 '14 at 00:58
  • @Sal . . . I think it meant that there was a typo in the table alias in the from clause, `m2` should have been `m`. – Gordon Linoff Aug 30 '14 at 01:08
1

Everybody else has offered great methods on doing this on the technical side, I'll just add one pragmatic point.

IMHO it is difficult to completely automate the process of eliminating the duplicates in a large table of persons. If the matching is too relaxed... then legitimate records will be deleted. If the matching is too strict... then duplicates will be left in.

For my clients I built a query like the above that returns rows that represent LIKELY duplicates, using Address and Last Name as the matching criterion. Then they eyeball the list of likely's and click "Delete" or "Merge" on the rows they choose to be a duplicate.

This might not work in your project (with billions of rows), but it makes sense in an environment where both duplicates and lost data need to be avoided. A single human operator can improve the cleanliness of a several thousand rows of data in a few minutes, and they can do it a little at a time over many sessions.

Ty H.
  • 905
  • 8
  • 8
0

IMHO there's no one best way to dedupe, which is why you see so many different solutions. It depends on your situation. Now, I have a situation where I have a big history file, listing the monthly metrics for each of tens of thousands of Loan accounts for years and years. Each account is represented in the file for the many month-ends while it remains active, but when it becomes inactive, it will appear at no later date. I want only the last or latest records for each account. I don't care about the record when the account was opened 20 years ago, and all I care about is the latest record, when the account closed 5 years ago. For those accounts still active, I want the record for the most recent calendar month. So I think "duplicates" are those records for the same account for all but the last monthly record for that account, and I want to get rid of them.

This may not be your exact problem, but the solution I present may give you the boost you want to your own solution.

(I do most of my sql code in SAS PROC SQL, but I think you will get the idea.)

My solution uses a subquery...

 /* dedupe */ 
 proc sql ;   
   create table &delta. as       
     select distinct b.*, sub.mxasof
       from &bravo. b
       join ( select distinct Loan_Number, max(asof) as mxasof format 6.0
                from &bravo.
                group by Loan_Number
            ) sub
       on 1
       and b.Loan_Number = sub.Loan_Number
       and b.asof = sub.mxasof
       where 1
       order by b.Loan_Number, b.asof desc ;