Remove duplicates from a table and re-link referencing rows to the new master

Question

I have a table transcription which contains passages of transcribed text and their citations with columns:

text, transcription_id(PK), t_notes, citation

and the second table town_transcription being the relationship table that links places (from another table) referenced in the text to that transcription record. This table has the columns:

town_id(FK), transcription_id(FK), confidence_interval

Many of these passages of text reference multiple towns, but stupidly I just duplicated records and linked them individually to each town. I have identified the duplicate rows of text using the following SQL query:

SELECT * FROM transcription aa
WHERE (select count(*) from transcription bb
WHERE (bb.text = aa.text) AND (bb.citation = aa.citation)) > 1
ORDER BY text ASC;

I now have about 2000 rows (2 to 6 duplicates of some text passages) where I need to delete the extra transcription_id's from the transcription table and change the transcription_id from the relationship table, town_transcription, to point to the remaining, now unique, transcription record. From reading other questions, I think utilizing UPDATE FROM and INNER JOIN might be necessary, but I really don't know how to implement this, I'm just a beginner, thanks for any help.

Erwin Brandstetter · Answer 1 · 2018-11-19T00:36:01.793

1

This single command should do it all:

WITH blacklist AS (  -- identify duplicate IDs and their master
   SELECT *
   FROM  (
      SELECT transcription_id
           , min(transcription_id) OVER (PARTITION BY text, citation) AS master_id
      FROM   transcription
      ) sub
   WHERE  transcription_id <> master_id
   )
, upd AS (  -- redirect referencing rows
   UPDATE town_transcription tt
   SET    transcription_id = b.master_id
   FROM   blacklist b
   WHERE  b.transcription_id = tt.transcription_id
   )
DELETE FROM transcription t  -- kill dupes (now without reference)
USING  blacklist b
WHERE  b.transcription_id = t.transcription_id;

For lack of definition I chose the row with the smallest ID per group as surviving master row.

FK constraints don't get in the way unless you have non-default settings. Detailed explanation:

After removing the dupes you might now want to add a UNIQUE constraint to prevent the same error from reoccurring:

ALTER TABLE transcription
ADD CONSTRAINT transcription_uni UNIQUE (text, citation);

edited Nov 19 '18 at 00:36

answered Nov 19 '18 at 00:23

Erwin Brandstetter

605,456
145
1,078
1,228

What is the significance of the 'sub' at the end of the first FROM clause? – C. Smith Nov 21 '18 at 07:21
Also, the query results in an ERROR: update or delete on table "transcription" violates foreign key constraint "town_transcription_transcription_id_fkey" on table "town_transcription" DETAIL: Key (transcription_id)=(4652) is still referenced from table "town_transcription". – C. Smith Nov 21 '18 at 10:59
And when I make an exception in the query for that record, it just counts down to the next one and says the same for 4651, etc. Do I need to blacklist the combination of `town_id` and `transcription_id` since they are a compound key? – C. Smith Nov 21 '18 at 11:05
@C.Smith: `sub` is the alias for the table derived in subquery. Every table requires a name. – Erwin Brandstetter Nov 21 '18 at 18:26
@C.Smith: You never mentioned a *compound key*. I assumed an FK to some `town` table for `town_id`. (Error msg seems unrelated either way.) Avoid misunderstandings with `CREATE TABLE` statements (with just relevant columns) showing actual data types and constraints instead of the less helpful pseudo-code you provided. You can only delete rows from `transcription` after all references have been removed. Are you sure, my `blacklist` with `PARTITION BY text, citation` covers *all* instances? Maybe dupes with empty string (`''`) vs NULL or typos slipping through? Or unrelated refs? – Erwin Brandstetter Nov 21 '18 at 18:32
Yes, the `town_id` column is from a separate `town` town table, I just happened to make the combination of it and `transcription_id` a compound primary key in the relationship table, my apologies for the lack of clarity. And regarding the `blacklist` covering all instances, after getting the error, I tried `PARTITION BY text, citation, ref_date, t_notes` afterwards to reduce the possibility of something slipping through. Same error. Some `citation` values are NULL. – C. Smith Nov 22 '18 at 20:22
If you provide a test setup in the question with actual `CREATE TABLE` statements and a couple of sample rows, I'll have another look. Maybe a [fiddle](https://dbfiddle.uk/) to play with ... And *always* disclose your version of Postgres. `SELECT version()` helps. – Erwin Brandstetter Nov 22 '18 at 23:45

Paul Maxwell · Answer 2 · 2018-11-19T00:33:02.550

Use row_number() over(...) to identify rows that repeat information. A partition by text, citation in the over clause will force the row number series to re-start at 1 for each unique set of those values:

select
     *
from (
       select
              text, transcription_id, t_notes, citation
            , row_number() over(partition by text, citation 
                                order by transcription_id) as rn
       from transcription 
     ) d
where rn > 1

Once you have verified those as the unwanted rows,then use the same logic for a delete statement.

However, you may loose information held in the t_notes column - are you willing to do that?

Remove duplicates from a table and re-link referencing rows to the new master

2 Answers2

Linked