Remove duplicates from table based on multiple criteria and persist to other table

Question

I have a taccounts table with columns like account_id(PK), login_name, password, last_login. Now I have to remove some duplicate entries according to a new business logic. So, duplicate accounts will be with either same email or same (login_name & password). The account with the latest login must be preserved.

Here are my attempts (some email values are null and blank)

DELETE
FROM taccounts
WHERE email is not null and char_length(trim(both ' ' from email))>0 and last_login NOT IN
(
SELECT MAX(last_login)
FROM taccounts
WHERE email is not null and char_length(trim(both ' ' from email))>0 
GROUP BY lower(trim(both ' ' from email)))

Similarly for login_name and password

DELETE
FROM taccounts
WHERE last_login NOT IN
(
SELECT MAX(last_login)
FROM taccounts
GROUP BY login_name, password)

Is there any better way or any way to combine these two separate queries?

Also some other table have account_id as foreign key. How to update this change for those tables?` I am using PostgreSQL 9.2.1

EDIT: Some of the email values are null and some of them are blank(''). So, If two accounts have different login_name & password and their emails are null or blank, then they must be considered as two different accounts.

What is the roughly estimated percentage of duplicates according to the new rules? Is `account_id` the primary key? What version of PostgreSQL? Can you afford to lock the table for some time (no concurrent access)? — Erwin Brandstetter, Mar 30 '13 at 11:04
@ErwinBrandstetter There are about 4500 rows and about 430 of them are unique. Yes `account_id` is the primary key. Postgre ver 1.16.0. Locking is actually not required as I am working on some migration thing — Anupam, Mar 30 '13 at 11:07
Please run `SELECT version()` in your database. 1.16 is probably the version of pgAdmin which you seem to be using. And **edit** your question with the additional (essential) information. — Erwin Brandstetter, Mar 30 '13 at 11:10
@ErwinBrandstetter I have edited the question with Pk and version info — Anupam, Mar 30 '13 at 11:15
More detail: can `last_login` happen to be the same for a pair of dupes? And can `login_name` and `password` also be `NULL`. And how to deal with that? — Erwin Brandstetter, Mar 30 '13 at 12:32
wrt the NULL values in emails: don't use an `IN select MAX())` subquery; `IN` behaves terrible when confronted with NULL values. Use `EXISTS(...)` instead. — wildplasser, Mar 30 '13 at 12:55
BTW: plsql is the procedural language of Oracle. Fixed the tags. — Erwin Brandstetter, Mar 30 '13 at 14:35
same `last_login` for two dupes is very rare but possible. No, `login_name` and `password` can never be null or empty — Anupam, Mar 30 '13 at 18:24

Erwin Brandstetter · Accepted Answer · 2021-05-15T23:07:00.303

If most of the rows are deleted (mostly dupes) and the table fits into RAM, consider this route:

SELECT surviving rows into a temporary table.
Reroute FK references to survivors
DELETE all rows from the base table.
Re-INSERT survivors.

1a. Distill surviving rows

CREATE TEMP TABLE tmp AS
SELECT DISTINCT ON (login_name, password) *
FROM  (
   SELECT DISTINCT ON (email) *
   FROM   taccounts
   ORDER  BY email, last_login DESC
   ) sub
ORDER  BY login_name, password, last_login DESC;

About DISTINCT ON:

Select first row in each GROUP BY group?

To identify duplicates for two different criteria, use a subquery to apply the two rules one after the other. The first step preserves the account with the latest last_login, so this is "serializable".

Inspect results and test for plausibility.

SELECT * FROM tmp;

Temporary tables are dropped automatically at the end of a session. In pgAdmin (which you seem to be using) the session lives as long as the editor window is open.

1b. Alternative query for updated definition of "duplicates"

SELECT *
FROM   taccounts t
WHERE  NOT EXISTS (
   SELECT  FROM taccounts t1
   WHERE  ( NULLIF(t1.email, '') = t.email
        OR (NULLIF(t1.login_name, ''), NULLIF(t1.password, '')) = (t.login_name, t.password))
   AND   (t1.last_login, t1.account_id) > (t.last_login, t.account_id)
   );

This doesn't treat NULL or empty string ('') as identical in any of the "duplicate" columns.

The row expression (t1.last_login, t1.account_id) takes care of the possibility that two dupes could share the same last_login. The one with the bigger account_id is chosen in this case - which is unique, since it is the PK.

2a. How to identify all incoming FKs

SELECT c.confrelid::regclass::text AS referenced_table
     , c.conname AS fk_name
     , pg_get_constraintdef(c.oid) AS fk_definition
FROM   pg_attribute a 
JOIN   pg_constraint c ON (c.conrelid, c.conkey[1]) = (a.attrelid, a.attnum)
WHERE  c.confrelid = 'taccounts'::regclass   -- (schema-qualified) table name
AND    c.contype  = 'f'
ORDER  BY 1, contype DESC;

Only building on the first column of the foreign key. More about that:

Find the referenced table name using table, field and schema name

Or inspect the Dependents rider in the right hand window of the object browser of pgAdmin after selecting the table taccounts.

2b. Reroute to new primary

If you have tables referencing taccounts (incoming foreign keys to taccounts) you will want to update all those fields, before you delete the dupes.
Reroute all of them to the new primary row:

UPDATE referencing_tbl r
SET    referencing_column = tmp.reference_column
FROM   tmp
JOIN   taccounts t1 USING (email)
WHERE  r.referencing_column = t1.referencing_column
AND    referencing_column IS DISTINCT FROM tmp.reference_column;

UPDATE referencing_tbl r
SET    referencing_column = tmp.reference_column
FROM   tmp
JOIN   taccounts t2 USING (login_name, password)
WHERE  r.referencing_column = t1.referencing_column
AND    referencing_column IS DISTINCT FROM tmp.reference_column;

3. & 4. Go in for the kill

Now, dupes are not referenced any more. Go in for the kill.

ALTER TABLE taccounts DISABLE TRIGGER ALL;
DELETE FROM taccounts;
VACUUM taccounts;
INSERT INTO taccounts
SELECT * FROM tmp;
ALTER TABLE taccounts ENABLE TRIGGER ALL;

Disable all triggers for the duration of the operation. This avoids checking for referential integrity during the operation. Everything should be fine once you re-activate triggers. We took care of all incoming FKs above. Outgoing FKs are guaranteed to be sound, since you have no concurrent write access and all values have been there before.

Why not `TRUNCATE TABLE taccounts;` instead of `DELETE` + `VACUUM` ? — Ihor Romanchenko, Mar 30 '13 at 11:19
+ some of the references will be broken with `DISABLE TRIGGER ALL` — Ihor Romanchenko, Mar 30 '13 at 11:21
Interesting. Pardon me if I am wrong but isn't this will eliminate accounts with blank or null emails with a single account? If two accounts have different login_name and password and email is blank or null they should be different accounts — Anupam, Mar 30 '13 at 11:25
@IgorRomanchenko: since there are only 4500 rows, the performance benefit from `TRUNCATE` is marginal or non-existing, since `DELETE` is generally faster for small tables. Also, `TRUNCATE` is more invasive. — Erwin Brandstetter, Mar 30 '13 at 11:25
@anu: The comments baited me to go all out and provide a comprehensive recipe. Enjoy. — Erwin Brandstetter, Mar 30 '13 at 12:05
@anu: about your comment - again, you want to clarify **in your question** the definition of what exactly is a duplicate in your case. `DISTINCT` treats `NULL` values as identical in any case. — Erwin Brandstetter, Mar 30 '13 at 12:18
@ErwinBrandstetter That is a great well explained answer. I have edited my question at the end about my concerns with the email value. What do you think about that? — Anupam, Mar 30 '13 at 12:32
@anu: I added a query to cover all contingencies. (Also fixed the WHERE condition now). — Erwin Brandstetter, Mar 30 '13 at 13:09
Thanks for such a great well referenced answer. I wish I could give more than 1 vote — Anupam, Mar 31 '13 at 08:04

score 1 · Answer 2 · edited May 15 '21 at 22:46

In addition to Erwin's excellent answer, it can often be useful to create in intermediate link-table that relates the old keys with the new ones.

DROP SCHEMA tmp CASCADE;
CREATE SCHEMA tmp ;
SET search_path=tmp;
CREATE TABLE taccounts
        ( account_id SERIAL PRIMARY KEY
        , login_name varchar
        , email varchar
        , last_login TIMESTAMP
        );
    -- create some fake data
INSERT INTO taccounts(last_login)
SELECT gs FROM generate_series('2013-03-30 14:00:00' ,'2013-03-30 15:00:00' , '1min'::interval) gs
        ;
UPDATE taccounts
SET login_name = 'User_' || (account_id %10)::text
        , email = 'Joe' || (account_id %9)::text || '@somedomain.tld'
        ;

SELECT * FROM taccounts;

        --
        -- Create (temp) table linking old id <--> new id
        -- After inspection this table can be used as a source for the FK updates
        -- and for the final delete.
        --
CREATE TABLE update_ids AS
WITH pairs AS (
        SELECT one.account_id AS old_id
        , two.account_id AS new_id
        FROM taccounts one
        JOIN taccounts two ON two.last_login > one.last_login
                AND ( two.email = one.email OR two.login_name = one.login_name)
        )
SELECT old_id,new_id
FROM pairs pp
WHERE NOT EXISTS (
        SELECT * FROM pairs nx
        WHERE nx.old_id = pp.old_id
        AND nx.new_id > pp.new_id
        )
        ;

SELECT * FROM update_ids
        ;

UPDATE other_table_with_fk_to_taccounts dst
SET account_id. = ids.new_id
FROM update_ids ids
WHERE account_id. = ids.old_id
        ;
DELETE FROM taccounts del
WHERE EXISTS (
        SELECT * FROM update_ids ex
        WHERE ex.old_id = del.account_id
        );

SELECT * FROM taccounts;

Yet another way to accomplish the same is to add a column with a pointer to the preferred key to the table itself and use that for your updates and deletes.

ALTER TABLE taccounts
        ADD COLUMN better_id INTEGER REFERENCES taccounts(account_id)
        ;

   -- find the *better* records for each record.
UPDATE taccounts dst
SET better_id = src.account_id
FROM taccounts src
WHERE src.login_name = dst.login_name
AND src.last_login > dst.last_login
AND src.email IS NOT NULL
AND NOT EXISTS (
        SELECT * FROM taccounts nx
        WHERE nx.login_name = dst.login_name
        AND nx.email IS NOT NULL
        AND nx.last_login > src.last_login
        );

    -- Find records that *do* have an email address
UPDATE taccounts dst
SET better_id = src.account_id
FROM taccounts src
WHERE src.login_name = dst.login_name
AND src.email IS NOT NULL
AND dst.email IS NULL
AND NOT EXISTS (
        SELECT * FROM taccounts nx
        WHERE nx.login_name = dst.login_name
        AND nx.email IS NOT NULL
        AND nx.last_login > src.last_login
        );

SELECT * FROM taccounts ORDER BY account_id;

UPDATE other_table_with_fk_to_taccounts dst
SET account_id = src.better_id
FROM update_ids src
WHERE dst.account_id = src.account_id
AND src.better_id IS NOT NULL
        ;

DELETE FROM taccounts del
WHERE EXISTS (
        SELECT * FROM taccounts ex
        WHERE ex.account_id = del.better_id
        );
SELECT * FROM taccounts ORDER BY account_id;

Great..One problem though. May be I was not clear in my question, if email matches then I dont have to worry about matching the login_name ,password. Similarly if login_name and password matches then email can be null or blank or whatever — Anupam, Mar 30 '13 at 18:18
You could add an extra update statement for the case where email matches, but name differs. (there is a semantic problem here: a row could have more than one replacement candidate) The advantage of these methods is that you can inspect the result before you press the big red button... The disadvantage is that the real data can change wrt the temp table. — wildplasser, Mar 30 '13 at 18:32