0

I am trying to delete duplicated rows from a table but seems that is not working.

The query

DELETE FROM notifications N1
USING notifications N2
WHERE N1.id < N2.id
      AND N1.user_id = N2.user_id
      AND N1.notification_type = N2.notification_type;

Some SGDBs says that there is no relation to notifications table. If I run in another app it takes like 1hr to delete them but looks like in the end it never commit and goes back to the same state before. The database is an AWS RDS instance and the table contains more than 40 million rows, and I expect to have less than 1 mi in the end.

Assumptions:

  • There is a Postgres lock for huge delete query.
  • Because of the size of the table, it drops the connection and never commits.
  • Because of the size and the table usability, I'll have to add a lock to the query.
  • Maybe I'll have to duplicate the table, make the delete and then replace the old table with the new table.

I am not sure what approach I must follow.

  • 1
    Instead of making a self join, calculate the ROW_NUMBER partitioned by user_id, notification_type and delete everything with a number>1. In both cases, the user_id,notification_type and id have to be covered by indexes. In the self join case though, missing indexes could result in 40M table scans – Panagiotis Kanavos Mar 14 '18 at 14:10
  • Can you posted the error message or your exception trace? – buqing Mar 14 '18 at 14:10
  • @Robin the query isn't failing, it's too slow. The execution plan would probably show some pretty big table scans – Panagiotis Kanavos Mar 14 '18 at 14:11
  • 2
    Did you try to rewrite that as an `EXISTS` query instead of the join? Or any other of the solutions suggested [here](https://stackoverflow.com/a/3777663/330315) or [here](https://stackoverflow.com/questions/1746213/how-to-delete-duplicate-entries) or [here](https://stackoverflow.com/q/6583916/330315) –  Mar 14 '18 at 14:14
  • I am going to try it, but I'll test on a local first. I'll post the results here. – Thiago Augustus Oliveira Mar 14 '18 at 14:26
  • The error is under @imsop answer – Thiago Augustus Oliveira Mar 14 '18 at 15:00

1 Answers1

1

It's hard to give a definitive answer without the full data to test with, but you might find it helps to break the query into two parts: determining the rows to delete, and deleting them.

CREATE TEMPORARY TABLE temp_notifications_to_delete (
    id Int
)

INSERT INTO temp_notifications_to_delete ( id )
SELECT N1.id
FROM notifications N1
JOIN notifications N2
ON N1.id < N2.id
      AND N1.user_id = N2.user_id
      AND N1.notification_type = N2.notification_type;

DELETE FROM notifications N
USING temp_notifications_to_delete D
WHERE N.id = D.id

Separating it this way prevents the table having to be locked while the rows to delete are calculated. It also makes it easier to test other improvements to the SELECT part, such as adding relevant indexes, and using alternative forms of the query.

As mentioned by a_horse_with_no_name, you could use an EXISTS:

INSERT INTO temp_notifications_to_delete ( id )
SELECT N1.id
FROM notifications N1
WHERE EXISTS (
   SELECT * 
   FROM notifications N2
      WHERE N1.id < N2.id
      AND N1.user_id = N2.user_id
      AND N1.notification_type = N2.notification_type
);

If you're OK with deleting one duplicate for each user_id, notification_type pair at a time, you could use a GROUP BY:

INSERT INTO temp_notifications_to_delete ( id )
SELECT MIN(id)
FROM notifications
GROUP BY user_id, notification_type
HAVING COUNT(*) > 1;

The suggestion from Panagiotis Kanavos to use window functions also sounds promising, although I don't know the exact query they had in mind.

IMSoP
  • 89,526
  • 13
  • 117
  • 169
  • Tried to run the query to insert the ids into the temp table and I am getting this message `ERROR: relation "notifications" does not exist`. Do you know what it can be? – Thiago Augustus Oliveira Mar 14 '18 at 14:59
  • @ThiagoAugustusOliveira "relation" just means "table"; it's telling you your table doesn't exist. I copied the table name from your question. – IMSoP Mar 14 '18 at 17:05
  • I found a possible issue. I wasn't referencing the right schema. Thanks. Still running the query so I am not 100% if this is the answer to relation issue. – Thiago Augustus Oliveira Mar 14 '18 at 17:14