Current Table
select * from currentTable;
select * from PG_TABLE_DEF where tablename='currenttable';
schemaname | tablename | column | type | encoding | distkey | sortkey | notnull
------------+--------------+-----------------------+-----------------------+----------+---------+---------+---------
public | currenttable | kafkaoffset | integer | az64 | f | 0 | t
public | currenttable | operation | character varying(25) | lzo | f | 0 | t
public | currenttable | othertablepk | integer | az64 | f | 0 | t
public | currenttable | othertableorderstatus | character varying(25) | lzo | f | 0 | t
| kafkaOffset | operation | otherTablePK | otherTableOrderStatus |
|:------------|----------:|:-------------:|-----------------------:
| 1024 | CREATE | 23 | Cooking
| 1025 | UPDATE | 23 | Shipped
| 1026 | UPDATE | 23 | Delivered
| 1027 | CREATE | 51 | Cooking
| 1028 | UPDATE | 51 | Shipped
| 1029 | CREATE | 52 | Cooking
I want to dedupe my current table to keep only the latest record(by kafkaOffset
) based on the otherTablePk
.
Deduped Table (Expected Result)
select * from currentTable;
| kafkaOffset | operation | otherTablePK | otherTableOrderStatus |
|:------------|----------:|:-------------:|-----------------------:
| 1026 | UPDATE | 23 | Delivered
| 1028 | UPDATE | 51 | Shipped
| 1029 | CREATE | 52 | Cooking
Solution-1: USING InnerJoin
and max
MYSQL like query in redshift using inner join
and max
. More Info.
DELETE
FROM currentTable
INNER JOIN
(SELECT max(kafkaOffset) AS lastOffset,
otherTablePk AS otherTablePkID
FROM currentTable
WHERE otherTablePkID IN
(SELECT otherTablePk
FROM currentTable
GROUP BY otherTablePk
HAVING count(*) > 1)
GROUP BY otherTablePk) lastTable ON lastTable.otherTablePkID = currentTable.otherTablePkID
WHERE current_table.kafkaOffset < lastTable.lastOffset;
Solution-2: Using USING
and doing Self Join.
DELETE from currentTable t1
JOIN currentTable t2 USING (otherTablePK)
WHERE t1.kafkaOffset < t2.kafkaOffset
Solution-3: Using TEMP
table and surgical deletes
As explained in this blog and this answer, but the use case is little different here. We need to delete everything but keep the latest., doing max makes the query slow.
All the solutions above would be slow in Redshift, it being a columnar storage. Please suggest what would be the fastest way to do this operation in Redshift?