2

I have two tables where one is holding "raw" data and another is holding "updated" data. The updated data just contains corrections of rows from the first table, but is essentially the same. It is a functional requirement for this data to be stored separately.

I want a query with the following conditions:

  • Select all rows from the first table
  • If there is a matching row in the second table (ie. when raw_d.primary_key_col_1 = edit_d.primary_key_col_1 and raw_d.primary_key_col_2 = edit_d.primary_key_col_2), we use the most recent (where most recent is based on column primary_key_col_3 values from the second table, rather than the first
  • Otherwise we use the values from the first table.

Note: I have many more "value" columns in the actual data. Considering the following toy example where I have two tables, raw_d and edit_d, that are quite similar as follows:

    primary_key_col_1    |    primary_key_col_2    |    value_col_1    |    value_col_2
-------------------------+-------------------------+-------------------+-------------------
           src_1         |         dest_1          |         0         |         1
           src_2         |         dest_2          |         5         |         4
           src_3         |         dest_3          |         2         |         2
           src_4         |         dest_4          |         6         |         3
           src_5         |         dest_5          |         9         |         9

    primary_key_col_1    |    primary_key_col_2    |    primary_key_col_3    |    value_col_1    |    value_col_2
-------------------------+-------------------------+-------------------------+---------------------------------------
           src_1         |         dest_1          |       2020-05-09        |         7         |         0
           src_2         |         dest_2          |       2020-05-08        |         6         |         1
           src_3         |         dest_3          |       2020-05-07        |         5         |         2
           src_1         |         dest_1          |       2020-05-08        |         3         |         4
           src_2         |         dest_2          |       2020-05-09        |         2         |         5

The expected result is as given:

    primary_key_col_1    |    primary_key_col_2    |    value_col_1    |    value_col_2
-------------------------+-------------------------+-------------------+-------------------
           src_1         |         dest_1          |         7         |         0
           src_2         |         dest_2          |         2         |         5
           src_3         |         dest_3          |         5         |         2
           src_4         |         dest_4          |         6         |         3
           src_5         |         dest_5          |         9         |         9

My proposed solution is to query the "greatest n per group" with the second table and then "overwrite" rows in a query of the first table, using Pandas.

The first query would just grab data from the first table:

SELECT * FROM raw_d

The second query to select "the greatest n per group" would be as follows:

SELECT DISTINCT ON (primary_key_col_1, primary_key_col_2) * FROM edit_d
ORDER BY primary_key_col_1, primary_key_col_2, primary_key_col_3 DESC;

I planned on merging the data like in Replace column values based on another dataframe python pandas - better way?.

Does anyone know a better solution, preferably using SQL only? For reference, I am using PostgreSQL and Pandas as part of my data stack.

2 Answers2

1

As I understood from your question, there are 2 ways to solve this

1. Using FULL OUTER JOIN

with cte as (
   select distinct on (primary_key_col_1,primary_key_col_2) * from edit_d 
   order by primary_key_col_1, primary_key_col_2, primary_key_col_3 desc
)

select 
coalesce(t1.primary_key_col_1,t2.primary_key_col_1),
coalesce(t1.primary_key_col_2,t2.primary_key_col_2),
coalesce(t1.value_col_1,t2.value_col_1),
coalesce(t1.value_col_2,t2.value_col_2)
from cte t1 
full outer join raw_d t2 
on t1.primary_key_col_1 = t2.primary_key_col_1 
and t1.primary_key_col_2 = t2.primary_key_col_2

DEMO

2. Using Union

select  
distinct on (primary_key_col_1, primary_key_col_2) 
primary_key_col_1, primary_key_col_2, value_col_1, value_col_2 

from (
  select * from edit_d 
  union all
  select primary_key_col_1,primary_key_col_2, null as "primary_key_col_3", 
  value_col_1,value_col_2 from raw_d
  order by primary_key_col_1, primary_key_col_2, primary_key_col_3 desc nulls last
  )tab

DEMO

Akhilesh Mishra
  • 5,876
  • 3
  • 16
  • 32
1

I would suggest phrasing the requirements as:

  • select the most recent row from the second table
  • bring in additional rows from the first table that don't match

This is a union all with distinct on:

(select distinct on (primary_key_col_1, primary_key_col_2) u.primary_key_col_1, u.primary_key_col_2, u.value_col_1, u.value_col_2
 from updated u
 order by primary_key_col_1, primary_key_col_2, primary_key_col_3 desc
) union all
select r.primary_key_col_1, r.primary_key_col_2, r.value_col_1, r.value_col_2
from raw r
where not exists (select 1
                  from updated u
                  where u.primary_key_col_1 = r.primary_key_col_2 and
                        u.primary_key_col_2 = r.primary_key_col_2
                 );
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786