Note: I've already gone over related questions like following that don't address my query
- SQL: how to pick one row for each set of rows with duplicate value in one column?
- Fill missing values with first non-null following value in Redshift
I have a sparse, unclean dataset like this
| id | operation | title | channel_type | mode |
|-----|-----------|----------|--------------|------|
| abc | Start | | | |
| abc | Start | recovery | | Link |
| abc | Start | recovery | SMS | |
| abc | Set | | Email | |
| abc | Verify | | Email | |
| pqr | Start | | | OTP |
| pqr | Verfiy | sign_in | Push | |
| pqr | Verify | | | |
| xyz | Start | sign_up | | Link |
and I need to fill up empty rows of each id
with non-empty data available from other rows
| id | operation | title | channel_type | mode |
|-----|-----------|----------|--------------|------|
| abc | Start | recovery | SMS | Link |
| abc | Start | recovery | SMS | Link |
| abc | Start | recovery | SMS | Link |
| abc | Set | recovery | Email | Link |
| abc | Verify | recovery | Email | Link |
| pqr | Start | sign_in | Push | OTP |
| pqr | Verfiy | sign_in | Push | OTP |
| pqr | Verify | sign_in | Push | OTP |
| xyz | Start | sign_up | | Link |
notes
- some
id
s can have a certain field as empty in all rows - and while most ids will have same non-empty values for each field, edge cases could have different values. For such groups, filling up any non-empty value in all rows is acceptable. [this is too rare in my dataset and can be ignored]
- another extra bit of pattern is that certain fields are mostly only present only against rows of certain
operation
s, for e.g.mode
is only present againstoperation='Start'
rows
I've tried grouping rows by id
while performing listagg
over title
, channel_type
and mode
columns, followed by coalesce
, something along the lines of this:
WITH my_data AS (
SELECT
id,
operation,
title,
channel_type,
mode
FROM
my_db.my_table
),
list_aggregated_data AS (
SELECT
id,
listagg(title) AS titles,
listagg(channel_type) AS channel_types,
listagg(mode) AS modes
FROM
my_data
GROUP BY
id
),
coalesced_data AS (
SELECT DISTINCT
id,
coalesce(titles) AS title,
coalesce(channel_types) AS channel_type,
coalesce(modes) AS mode
FROM
list_aggregated_data
),
joined_data AS (
SELECT
md.id,
md.operation,
cd.title,
cd.channel_type,
cd.mode
FROM
my_data AS md
LEFT JOIN
coalesced_data AS cd ON cd.id = md.id
)
SELECT
*
FROM
joined_data
ORDER BY
id,
operation
But for some reason this is resulting in concatenation of values (presumably from coalesce
operation), where I get
| id | operation | title | channel_type | mode |
|-----|-----------|------------------|--------------|------|
| abc | Start | recoveryrecovery | SMS | Link |
| abc | Start | recoveryrecovery | SMS | Link |
| abc | Start | recoveryrecovery | SMS | Link |
| abc | Set | recoveryrecovery | Email | Link |
| abc | Verify | recoveryrecovery | Email | Link |
| pqr | Start | sign_in | Push | OTP |
| pqr | Verfiy | sign_in | Push | OTP |
| pqr | Verify | sign_in | Push | OTP |
| xyz | Start | sign_up | | Link |
What's the correct way to approach this problem?