1

I am using postgresql. To delete duplicates in a table I use this query:

DELETE FROM dups a USING (
      SELECT MIN(ctid) as ctid, key
        FROM dups 
        GROUP BY key HAVING COUNT(*) > 1
      ) b
      WHERE a.key = b.key 
      AND a.ctid <> b.ctid

Reference: https://stackoverflow.com/a/12963112/4940278

However, there is a table say ref_table where the dups.id are referenced as well. I need to update the other table before deleting the duplicates. What is the query to update the reference table with the duplicate's id, so that there is no data loss?

For example:

Table 1, say dups

id key
1  Luna
2  Hermione
3  Luna

Table 2, say ref_table

id dups_id data
1   2     Auror
2   1     Researcher

Now the query to delete the duplicates will delete the record with id 1 in dups table as it is a duplicate. However that record is referenced in the ref_table, so I need to update that with the record that is going to be retained.

i.e) Table1 should become:

id key
2  Hermione
3  Luna

and Table 2 should become:

id dups_id data
1   2     Auror
2   3     Researcher
Surya
  • 2,429
  • 1
  • 21
  • 42

2 Answers2

1

Use a CTE to identify the rows maintained in dups, then update the ref rows so that the FK point only to them, finally removed the rows no longer desired.

with keeper as                                     -- identify dups rows to be kept
     ( select id, key
            , max(id) over(partition by key) mid 
         from dups)

   , u as                                          -- update ref so dup_id references only those being kept
     ( update ref r 
        set dup_id =  (select k.mid 
                         from keeper k
                         join dups   d 
                           on (k.id=d.id)
                        where r.dup_id != k.mid
                          and r.dup_id = k.id
                      )
     )   

delete                                             -- final target remove dups rows no lnger needed
  from dups d                                            
 where d.id not in (select mid from keeper);    
Belayer
  • 13,578
  • 2
  • 11
  • 22
  • Shouldn't the join condition here be `k.keyid=d.key` ? – Surya Sep 08 '19 at 01:30
  • @Surya, no. First off k.keyid doesn't exist, but I assume that's a typo and you meant k.key=d.key. Still no; doing so results in "ERROR: more than one row returned by a subquery". The Keeper CTE contains 2 ids - the value needing Updated, and the value to update to,and evntually keep - during the delete. Actually the column Key is not needed in the keeper CTE at all - seems it's a remnant from testing. Disclaimer: For full discloser I defined both id columns as PK and ref.dups_id as FK. Also added a 3rd row for Luna and another set of dups for testing. – Belayer Sep 08 '19 at 18:59
  • Thank you for your explanation. I still don't understand though. I don't yet know what `partition by` does. But I will read the docs and find out how this query works. Your query works. But unfortunately, my production data is huge has `152670` records. The table is not indexed as well. So the query is running forever. So I have decided to update and delete the records using a ruby script. – Surya Sep 09 '19 at 06:14
  • @Surya, The Partition clause applies to Window functions and provides the same functionality for window function that Group By provides for the query as a whole. BTW a table with 152000+ rows is not huge, it actually approaches small. You'll be better off creating a PK (that will create an index). Good Luck to you. – Belayer Sep 09 '19 at 07:41
  • If by PK you mean PrimaryKey I do have PrimaryKey the id is the Primary Key. But I thought it would have been faster if I had indexed on `key`, maybe I should index on `key` and try this query again. Thanks anyways. – Surya Sep 09 '19 at 08:14
  • It took close to 15-20min to update and then delete. – Surya Sep 09 '19 at 08:15
0

You can use a CTE:

with d as (
      DELETE FROM dups a USING
             (SELECT MIN(ctid) as ctid, key
              FROM dups 
              GROUP BY key HAVING COUNT(*) > 1
             ) b
      WHERE a.key = b.key AND a.ctid <> b.ctid
      RETURNING *
     )
update othertable
   set . . .
   from d
   where . . .;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • ` DELETE FROM dups a USING (SELECT MIN(ctid) as ctid, key FROM dups GROUP BY key HAVING COUNT(*) > 1 ) b WHERE a.key = b.key AND a.ctid <> b.ctid RETURNING *` This gives me the deleted records, but I need to access the records that are retained inorder to update the other table. How can I do so? More clarity please – Surya Sep 06 '19 at 17:12
  • @Surya . . . I don't understand. If you need to access the records that are *retained*, just query the table after the `delete`. – Gordon Linoff Sep 06 '19 at 19:11
  • Have updated the question with examples for better clarity, please check @Gordon – Surya Sep 07 '19 at 15:21