0

Note: I've already gone over related questions like following that don't address my query


I have a sparse, unclean dataset like this

| id  | operation | title    | channel_type | mode |
|-----|-----------|----------|--------------|------|
| abc | Start     |          |              |      |
| abc | Start     | recovery |              | Link |
| abc | Start     | recovery | SMS          |      |
| abc | Set       |          | Email        |      |
| abc | Verify    |          | Email        |      |
| pqr | Start     |          |              | OTP  |
| pqr | Verfiy    | sign_in  | Push         |      |
| pqr | Verify    |          |              |      |
| xyz | Start     | sign_up  |              | Link |

and I need to fill up empty rows of each id with non-empty data available from other rows

| id  | operation | title    | channel_type | mode |
|-----|-----------|----------|--------------|------|
| abc | Start     | recovery | SMS          | Link |
| abc | Start     | recovery | SMS          | Link |
| abc | Start     | recovery | SMS          | Link |
| abc | Set       | recovery | Email        | Link |
| abc | Verify    | recovery | Email        | Link |
| pqr | Start     | sign_in  | Push         | OTP  |
| pqr | Verfiy    | sign_in  | Push         | OTP  |
| pqr | Verify    | sign_in  | Push         | OTP  |
| xyz | Start     | sign_up  |              | Link |

notes

  • some ids can have a certain field as empty in all rows
  • and while most ids will have same non-empty values for each field, edge cases could have different values. For such groups, filling up any non-empty value in all rows is acceptable. [this is too rare in my dataset and can be ignored]
  • another extra bit of pattern is that certain fields are mostly only present only against rows of certain operations, for e.g. mode is only present against operation='Start' rows

I've tried grouping rows by id while performing listagg over title, channel_type and mode columns, followed by coalesce, something along the lines of this:

WITH my_data AS (
  SELECT
    id,
    operation,
    title,
    channel_type,
    mode
  FROM
    my_db.my_table
),

list_aggregated_data AS (
  SELECT
    id,
    listagg(title) AS titles,
    listagg(channel_type) AS channel_types,
    listagg(mode) AS modes
  FROM
    my_data
  GROUP BY
    id
),

coalesced_data AS (
  SELECT DISTINCT
    id,
    coalesce(titles) AS title,
    coalesce(channel_types) AS channel_type,
    coalesce(modes) AS mode
  FROM
    list_aggregated_data
),

joined_data AS (
  SELECT
    md.id,
    md.operation,
    cd.title,
    cd.channel_type,
    cd.mode
  FROM
    my_data AS md
  LEFT JOIN
    coalesced_data AS cd ON cd.id = md.id
)

SELECT
  *
FROM
  joined_data
ORDER BY
  id,
  operation

But for some reason this is resulting in concatenation of values (presumably from coalesce operation), where I get

| id  | operation | title            | channel_type | mode |
|-----|-----------|------------------|--------------|------|
| abc | Start     | recoveryrecovery | SMS          | Link |
| abc | Start     | recoveryrecovery | SMS          | Link |
| abc | Start     | recoveryrecovery | SMS          | Link |
| abc | Set       | recoveryrecovery | Email        | Link |
| abc | Verify    | recoveryrecovery | Email        | Link |
| pqr | Start     | sign_in          | Push         | OTP  |
| pqr | Verfiy    | sign_in          | Push         | OTP  |
| pqr | Verify    | sign_in          | Push         | OTP  |
| xyz | Start     | sign_up          |              | Link |

What's the correct way to approach this problem?

y2k-shubham
  • 10,183
  • 11
  • 55
  • 131

1 Answers1

0

I'd start with the first_value() window function with the ignore nulls option. You will partition by the first 2 columns and will need to work out the edge cases with some data massaging, likely in the order by clause of the window function.

Bill Weiner
  • 8,835
  • 2
  • 7
  • 18