0

I'm used to de-duplicate on all sort of databases using a single command, generally something like this:

DELETE
FROM
  table AS original
USING
(
  SELECT
    *,
    ROW_NUMBER() OVER (
      PARTITION BY col1, col2, col3, ...
    ) AS rn
  FROM
    table
) AS other
WHERE
  rn > 1 AND
  original.col1 = other.col1 AND
  original.col2 = other.col2 AND
  original.col3 = other.col3 AND ...
;

This will only delete the duplicates, leaving the first occurrences of each row behind, which is what I expect.

I tried to replicate that on BigQuery, and the only way I was able to get almost there was with MERGE, with a similar statement:

MERGE
  `table` orig
USING
(
  SELECT
    *,
    ROW_NUMBER() OVER (
      PARTITION BY col1, col2, col3, ...
    ) AS rn
  FROM
    table
) AS other
ON
  (orig.col1 = other.col1 OR (orig.col1 IS NULL AND other.col1 IS NULL)) AND
  (orig.col2 = other.col2 OR (orig.col2 IS NULL AND other.col2 IS NULL)) AND
  (orig.col3 = other.col3 OR (orig.col3 IS NULL AND other.col3 IS NULL)) AND ...
WHEN MATCHED AND other.rn > 1 THEN
  DELETE
;

This kind of works: it deletes all rows that have duplicates, including the first occurrences. I think it's the way BigQuey deletes stuff on MERGE, like any field that matches those fields will be deleted or something, but I need the first occurrences to be kept back. Any ideas?

  • 1
    Consider keeping your original table as a source and use a view (or your favorite ETL tool) to de-dupe and give you a working table. – rtenha Jul 03 '19 at 16:17
  • That would cause me to de-dupe hundreds of millions to dozens of billions of rows per table (and I have around a hundred tables). Not a good option for me. – Ricardo Panaggio Jul 03 '19 at 17:06
  • Have you tried these methods : https://stackoverflow.com/questions/36675521/delete-duplicate-rows-from-a-bigquery-table ? Basically, instead of deleting duplicate rows, you create a new table with only rows that you want and overwrite the original table. – Yun Zhang Jul 03 '19 at 17:41
  • Again, we're talking about billions of rows for around a hundred tables. Re-creating everything is going to be much more expansive than just removing duplicates. And this needs to be done everyday, so recreating everything everyday will make the price skyrocket. – Ricardo Panaggio Jul 03 '19 at 17:49
  • Deleting rows every day will be expensive as well. Can you modify your import process to de-dupe before you load to BQ? Also, is your table partitioned? That would save you from having to do the whole table every day...you would just de-dupe the most recent partition. – rtenha Jul 03 '19 at 18:50
  • We have around ~0,01% of duplicates generated, so deleting that isn't that expensive (compared to the total volume). Right now I need to fix this on my side of thing, and later the data engineers that take care of the pipeline will try to de-dupe stuff before it arrives on BQ. – Ricardo Panaggio Jul 03 '19 at 19:59
  • Yes, it's partitioned, but If I replace the table, I'd have to re-generate the entire table. The idea is to run the in-place de-dupe once a day on the previous day data until the de-duplication comes to the beginning of the pipeline (which might take some time). – Ricardo Panaggio Jul 03 '19 at 20:05

0 Answers0