0

This is unlike the selecting row with latest timestamp question and is specific to Redshift

I want to allow users to update parts of a (staging) table row at different points in time while avoiding invoking UPDATE statements. This is done by an append-only approach where we keep adding rows where only the unique id and a timestamp is mandatory and the other columns may or may not have a value provided.

Question:

Given a table where apart from "primary key" (not truely enforced) and a timestamp column, all other columns in that table are nullable, how do I merge all rows that have the same primary key into one row by picking the most recent non-null value for each nullable columns, if one such non-null value exists.

Example:

|id|timestamp|status|stringcol|numcol|
|1 |456      |begin |         |      |
|1 |460      |      |         |  2   |
|2 |523      |      |  foo    |      |
|1 |599      |mid   |  blah   |      |
|2 |624      |begin |         |      |
|1 |721      |done  |         |  60  |

should produce

|id|timestamp|status|stringcol|numcol|
|2 |624      |begin |  foo    |      |
|1 |721      |done  |  blah   |  60  |
Lim
  • 329
  • 2
  • 15
  • There is a `DISTINCT ON` feature in PostgreSQL that does exactly this! However, it is not available in Amazon Redshift. Here is something that claims to mimic that behaviour: [DISTINCT ON like functionality for Redshift](https://gist.github.com/jmindek/62c50dd766556b7b16d6) – John Rotenstein May 06 '19 at 21:10

1 Answers1

1

This can be achieved using Redshift's LISTAGG function combined with SPLIT_PART function.

Using the above example 5-column table, you would need something like this:

SELECT id, 
       MAX(last_updated),
       SPLIT_PART(LISTAGG(status, ',') WITHIN GROUP(ORDER BY last_updated DESC), ',', 1),
       SPLIT_PART(LISTAGG(stringcol, ',') WITHIN GROUP(ORDER BY last_updated DESC), ',', 1),
       SPLIT_PART(LISTAGG(numcol, ',') WITHIN GROUP(ORDER BY last_updated DESC), ',', 1)
FROM table
GROUP BY 1;
Lim
  • 329
  • 2
  • 15