Best practices for dealing with duplicate rows caused by unnested records in BigQuery?

Question

Working with data coming from Facebook more often than not involves working with records where, in my case, all the “spicy” data is at. However, there is a downside, namely the huge amount of duplicate rows, which when not handled properly can cause over-reporting and/or data discrepancy.

Below is a use case which when joined with my primary data (coming from tables which do not involve any unnesting) causes a slight discrepancy in the final numbers.

Technologies used - Facebook Data -> Stitch -> BigQuery -> dbt -> Google Data Studio

I would usually create separate models where I’d unnest a record, transform the data and then join it into the rest of my models. An example of this is getting all website purchase conversion from the ads_insights’s actions record.  Here is the difference though:

 Query:

SELECT count(*) AS row_count
FROM ads_insights

Result:  row_count - 316  

Query:

SELECT count(*) AS row_count
FROM ads_insights,
UNNEST(actions) AS actions

Result:  row_count - 5612

 After unnesting, I’d use the row data to create columns for each conversion like so:

CASE WHEN value.action_type = 'offsite_conversion.fb_pixel_purchase' THEN COALESCE(value._28d_click, 0) + COALESCE(value._1d_view, 0) ELSE 0 END AS website_purchase

 And finally I would join this model to the rest of my models. The only problem is that those 5600 rows cause a slight discrepancy when joined with the rest, and since I’ve already used the row data to create the columns, I don’t care about the unnested record data anymore, and I can go back to my original 316 rows. The only question is how? What techniques are out there that will help me clean up my model?

Solution: Even though at some point I'd aggregate and group all the fields in my query like dylanbaker suggested in his answer, the discrepancy would still persist, and after doing a deep dive at my data I found that the unnested query will return 279 rows, whereas the nested one will return 314. This focused my attention at the unnesting query, where it will remove 35 rows, and those 35 rows happened to be null. After doing some google search I found this StackOverflow article which suggest using LEFT JOIN UNNEST to preserve all rows that have null record values, instead of CROSS JOIN UNNEST which will remove them.

score 1 · Answer 1 · answered Sep 18 '20 at 13:22

1

You would typically want to do a 'pivot' here. You're most of the way there, you just need to sum and group by the relevant columns in order to get this back to the grain that you originally had and want.

I believe you'll want something like this:

select
    ads_insights.some_column,
    ads_insights.some_other_column,
    sum(case 
          when value.action_type = 'offsite_conversion.fb_pixel_purchase' 
          then coalesce(value._28d_click, 0) + coalesce(value._1d_view, 0) 
          else 0 
        end) AS website_purchase
from ads_insights,
unnest(actions) as actions
group by 1,2

The initial columns would be whatever you want from the original table. The 'sum case whens' would be to pivot and aggregate the unnested data.

answered Sep 18 '20 at 13:22

dylanbaker

381
2
5

This is what I thought as well but grouping the non-aggregated fields just won't cut it since the unnesting had already caused havoc to the rest of the data (e.g. columns like spend) I'd use your code as CTE called "actions" where I'd do the pivotting, and another CTE called "insights" where I'd get the ad_spend for "some_column" and "some_other_column" without doing any unnesting. The first CTE (CTE1) would return 279 rows, whereas the second CTE (CTE2) 314. When I left join CTE1 with CTE2, I end up with 279. And when I FULL JOIN, I end up with 314 rows but 35 col1 and col2 are null. – Angel Dimov Sep 18 '20 at 14:46
Okay, I figured that the 35 missing records have no actions, thus when I unnest(actions), they are removed from the table because they are null. After doing a quick google search I found this StackOverflow article which suggests instead of using CROSS JOIN UNNEST(record), to use LEFT JOIN UNNEST(record) which will keep all null values as well. And this turned out to be the last piece to the puzzle! Article link: https://stackoverflow.com/questions/44918108/google-bigquery-i-lost-null-row-when-using-unnest-function – Angel Dimov Sep 18 '20 at 15:02
In relation to your first comment, why doesn't grouping by spend work? Isn't the spend the same for all of the relevant actions? – dylanbaker Sep 18 '20 at 16:00
You are completely right that I can group by ad_spend as well and skip the 2 extra CTEs! – Angel Dimov Sep 18 '20 at 20:03

score 1 · Answer 2 · answered Sep 18 '20 at 19:26

1

You can actually do some magic with unnests inside the select statement

Does this work for you?

SELECT
    some_column,
    (SELECT coalesce(_28d_click, 0) + coalesce(_1d_view, 0) from unnest(actions) WHERE action_type = "offsite_conversion.fb_pixel_purchase") AS website_purchase
FROM ads_insights

answered Sep 18 '20 at 19:26

Thomas

569
3
10

1

I was really excited when I first learned about scalar subqueries but my excitement was short-lived because BigQuery won't allow me to run more than 6-7 at a time, and in my case we are talking about pivoting tens of fields at the same time, so I went back to my original solution. Just had to change CROSS JOIN UNNEST for LEFT JOIN UNNEST to keep all nulls and avoid data discrepancy. – Angel Dimov Sep 23 '20 at 14:09

Best practices for dealing with duplicate rows caused by unnested records in BigQuery?

2 Answers2