Redshift delete duplicate records but keep latest

Question

Current Table

select * from currentTable;

select * from PG_TABLE_DEF where tablename='currenttable';
 schemaname |  tablename   |        column         |         type          | encoding | distkey | sortkey | notnull
------------+--------------+-----------------------+-----------------------+----------+---------+---------+---------
 public     | currenttable | kafkaoffset           | integer               | az64     | f       |       0 | t
 public     | currenttable | operation             | character varying(25) | lzo      | f       |       0 | t
 public     | currenttable | othertablepk          | integer               | az64     | f       |       0 | t
 public     | currenttable | othertableorderstatus | character varying(25) | lzo      | f       |       0 | t

| kafkaOffset | operation | otherTablePK  | otherTableOrderStatus |
|:------------|----------:|:-------------:|-----------------------:
| 1024        | CREATE    |  23           | Cooking
| 1025        | UPDATE    |  23           | Shipped
| 1026        | UPDATE    |  23           | Delivered
| 1027        | CREATE    |  51           | Cooking
| 1028        | UPDATE    |  51           | Shipped
| 1029        | CREATE    |  52           | Cooking

I want to dedupe my current table to keep only the latest record(by kafkaOffset) based on the otherTablePk.

Deduped Table (Expected Result)

select * from currentTable;

| kafkaOffset | operation | otherTablePK  | otherTableOrderStatus |
|:------------|----------:|:-------------:|-----------------------:
| 1026        | UPDATE    |  23           | Delivered
| 1028        | UPDATE    |  51           | Shipped
| 1029        | CREATE    |  52           | Cooking

Solution-1: USING `InnerJoin` and `max`

MYSQL like query in redshift using inner join and max. More Info.

DELETE
FROM currentTable
INNER JOIN
  (SELECT max(kafkaOffset) AS lastOffset,
          otherTablePk AS otherTablePkID
   FROM currentTable
   WHERE otherTablePkID IN
       (SELECT otherTablePk
        FROM currentTable
        GROUP BY otherTablePk
        HAVING count(*) > 1)
   GROUP BY otherTablePk) lastTable ON lastTable.otherTablePkID = currentTable.otherTablePkID
WHERE current_table.kafkaOffset < lastTable.lastOffset;

Solution-2: Using `USING` and doing Self Join.

DELETE from currentTable t1 
JOIN currentTable t2 USING (otherTablePK) 
WHERE t1.kafkaOffset < t2.kafkaOffset

Solution-3: Using `TEMP` table and surgical deletes

As explained in this blog and this answer, but the use case is little different here. We need to delete everything but keep the latest., doing max makes the query slow.

All the solutions above would be slow in Redshift, it being a columnar storage. Please suggest what would be the fastest way to do this operation in Redshift?

I have not put any index as it is Redshift. I just wanted to know the performance in big tables. This is a test table. Have not gone into production yet. — Alok Kumar Singh, Aug 31 '20 at 07:52
Without index it will perform slow. Anyhow you can use sub query approach which is mentioned in @Jonathan Jacobson's answer should be faster comparatively. — Akhilesh Mishra, Aug 31 '20 at 07:59

score 0 · Answer 1 · answered Aug 31 '20 at 07:44

0

Please include the table DDL. Otherwise it's all speculation.

Talking about speculation, please try this untested query (I don't have Redshift):

DELETE FROM currentTable
WHERE kafkaOffset IN (
    SELECT kafkaOffset
    FROM (
        SELECT kafkaOffset
             , row_number() OVER (PARTITION BY otherTablePK ORDER BY kafkaOffset DESC) rn
        FROM currentTable
    ) t
    WHERE rn > 1
);

answered Aug 31 '20 at 07:44

Jonathan Jacobson

1,441
11
20

Posted the table DDL. kafkaOffset is the primary and otherTablePrimaryKey is the primary key in another table. This is a temp table which i want to dedupe to finally insert in the otherTable. – Alok Kumar Singh Aug 31 '20 at 07:51
1

@AlokKumarSingh then don't delete at all! Just select the rows that you want. – Jonathan Jacobson Aug 31 '20 at 08:00

Redshift delete duplicate records but keep latest

Current Table

Deduped Table (Expected Result)

Solution-1: USING InnerJoin and max

Solution-2: Using USING and doing Self Join.

Solution-3: Using TEMP table and surgical deletes

1 Answers1

Solution-1: USING `InnerJoin` and `max`

Solution-2: Using `USING` and doing Self Join.

Solution-3: Using `TEMP` table and surgical deletes