1

My table is somethingg like

CREATE TABLE table1
(
    _id         text,
    name        text,
    data_type   int,
    data_value  int,
    data_date   timestamp -- insertion time
);

Now due to a system bug, many duplicate entries are created and I need to remove those duplicated and keep only unique entries excluding data_date because it is a system generated date.

My query to do that is something like:

DELETE FROM table1 A
USING ( SELECT _id, name, data_type, data_value, MIN(data_date) min_date
        FROM table1
        GROUP BY _id, name, data_type, data_value
        HAVING count(data_date) > 1) B
WHERE A._id = B._id
AND A.name = B.name
AND A.data_type = B.data_type
AND A.data_value = B.data_value
AND A.data_date != B.min_date;

However this query works, having millions of records in the table, I want a faster way for it. My idea is to create a new column with value as partition by [_id, name, data_type, data_value] or columns which are in group by. However, I could not find the way to create such column. I would appretiate if any one may suggest a way to create such column.

Edit 1: There is another thing to add, I don't want to use CTE or subquery for updating this new column because it will be same as my existing query.

Harsh Shankar
  • 506
  • 5
  • 16

1 Answers1

0

The best way is simply creating a new table without duplicated records:

CREATE...
SELECT _id, name, data_type, data_value, MIN(data_date) min_date
        FROM table1
        GROUP BY _id, name, data_type, data_value;

Alternatively, you can create a rank and then filter, but a subquery is needed.

RANK() OVER (PARTITION BY your_variables ORDER BY data_date ASC) r

And then filter r=1.

ecp
  • 319
  • 1
  • 6
  • 18
  • Having a duplicate table with millions of doesn't seems to be a good idea. This will take a lot of time creating. I actually tried that and took around 25 mins or so. The server was almost dead for that time. Can't block my production server for that long. – Harsh Shankar Dec 30 '19 at 11:26
  • Take a look: https://stackoverflow.com/questions/6583916/delete-duplicate-records-in-postgresql – ecp Dec 30 '19 at 11:48
  • alternatively you could speed up your query. – ecp Dec 30 '19 at 12:00
  • I would want to speed-up but not sure how to do that – Harsh Shankar Dec 30 '19 at 12:01
  • Maybe use DISTINCT ON, or joins with HASH() or even use ANTI JOIN. – ecp Dec 30 '19 at 12:04
  • Then maybe the best thing to do is try to debug duplicates from its origin. – ecp Dec 30 '19 at 12:55