1

Current Table

select * from currentTable;
select * from PG_TABLE_DEF where tablename='currenttable';
 schemaname |  tablename   |        column         |         type          | encoding | distkey | sortkey | notnull
------------+--------------+-----------------------+-----------------------+----------+---------+---------+---------
 public     | currenttable | kafkaoffset           | integer               | az64     | f       |       0 | t
 public     | currenttable | operation             | character varying(25) | lzo      | f       |       0 | t
 public     | currenttable | othertablepk          | integer               | az64     | f       |       0 | t
 public     | currenttable | othertableorderstatus | character varying(25) | lzo      | f       |       0 | t
| kafkaOffset | operation | otherTablePK  | otherTableOrderStatus |
|:------------|----------:|:-------------:|-----------------------:
| 1024        | CREATE    |  23           | Cooking
| 1025        | UPDATE    |  23           | Shipped
| 1026        | UPDATE    |  23           | Delivered
| 1027        | CREATE    |  51           | Cooking
| 1028        | UPDATE    |  51           | Shipped
| 1029        | CREATE    |  52           | Cooking

I want to dedupe my current table to keep only the latest record(by kafkaOffset) based on the otherTablePk.

Deduped Table (Expected Result)

select * from currentTable;
| kafkaOffset | operation | otherTablePK  | otherTableOrderStatus |
|:------------|----------:|:-------------:|-----------------------:
| 1026        | UPDATE    |  23           | Delivered
| 1028        | UPDATE    |  51           | Shipped
| 1029        | CREATE    |  52           | Cooking

Solution-1: USING InnerJoin and max

MYSQL like query in redshift using inner join and max. More Info.

DELETE
FROM currentTable
INNER JOIN
  (SELECT max(kafkaOffset) AS lastOffset,
          otherTablePk AS otherTablePkID
   FROM currentTable
   WHERE otherTablePkID IN
       (SELECT otherTablePk
        FROM currentTable
        GROUP BY otherTablePk
        HAVING count(*) > 1)
   GROUP BY otherTablePk) lastTable ON lastTable.otherTablePkID = currentTable.otherTablePkID
WHERE current_table.kafkaOffset < lastTable.lastOffset;

Solution-2: Using USING and doing Self Join.

DELETE from currentTable t1 
JOIN currentTable t2 USING (otherTablePK) 
WHERE t1.kafkaOffset < t2.kafkaOffset

Solution-3: Using TEMP table and surgical deletes

As explained in this blog and this answer, but the use case is little different here. We need to delete everything but keep the latest., doing max makes the query slow.

All the solutions above would be slow in Redshift, it being a columnar storage. Please suggest what would be the fastest way to do this operation in Redshift?

Alok Kumar Singh
  • 2,331
  • 3
  • 18
  • 37

1 Answers1

0

Please include the table DDL. Otherwise it's all speculation.

Talking about speculation, please try this untested query (I don't have Redshift):

DELETE FROM currentTable
WHERE kafkaOffset IN (
    SELECT kafkaOffset
    FROM (
        SELECT kafkaOffset
             , row_number() OVER (PARTITION BY otherTablePK ORDER BY kafkaOffset DESC) rn
        FROM currentTable
    ) t
    WHERE rn > 1
);
Jonathan Jacobson
  • 1,441
  • 11
  • 20
  • Posted the table DDL. kafkaOffset is the primary and otherTablePrimaryKey is the primary key in another table. This is a temp table which i want to dedupe to finally insert in the otherTable. – Alok Kumar Singh Aug 31 '20 at 07:51
  • 1
    @AlokKumarSingh then don't delete at all! Just select the rows that you want. – Jonathan Jacobson Aug 31 '20 at 08:00