Working with data coming from Facebook more often than not involves working with records where, in my case, all the “spicy” data is at. However, there is a downside, namely the huge amount of duplicate rows, which when not handled properly can cause over-reporting and/or data discrepancy.
Below is a use case which when joined with my primary data (coming from tables which do not involve any unnesting) causes a slight discrepancy in the final numbers.
Technologies used - Facebook Data -> Stitch -> BigQuery -> dbt -> Google Data Studio
I would usually create separate models where I’d unnest a record, transform the data and then join it into the rest of my models. An example of this is getting all website purchase conversion from the ads_insights’s actions record. Here is the difference though:
Query:
SELECT count(*) AS row_count
FROM ads_insights
Result: row_count - 316
Query:
SELECT count(*) AS row_count
FROM ads_insights,
UNNEST(actions) AS actions
Result: row_count - 5612
After unnesting, I’d use the row data to create columns for each conversion like so:
CASE WHEN value.action_type = 'offsite_conversion.fb_pixel_purchase' THEN COALESCE(value._28d_click, 0) + COALESCE(value._1d_view, 0) ELSE 0 END AS website_purchase
And finally I would join this model to the rest of my models. The only problem is that those 5600 rows cause a slight discrepancy when joined with the rest, and since I’ve already used the row data to create the columns, I don’t care about the unnested record data anymore, and I can go back to my original 316 rows. The only question is how? What techniques are out there that will help me clean up my model?
Solution: Even though at some point I'd aggregate and group all the fields in my query like dylanbaker suggested in his answer, the discrepancy would still persist, and after doing a deep dive at my data I found that the unnested query will return 279 rows, whereas the nested one will return 314. This focused my attention at the unnesting query, where it will remove 35 rows, and those 35 rows happened to be null. After doing some google search I found this StackOverflow article which suggest using LEFT JOIN UNNEST to preserve all rows that have null record values, instead of CROSS JOIN UNNEST which will remove them.