I'm used to de-duplicate on all sort of databases using a single command, generally something like this:
DELETE
FROM
table AS original
USING
(
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY col1, col2, col3, ...
) AS rn
FROM
table
) AS other
WHERE
rn > 1 AND
original.col1 = other.col1 AND
original.col2 = other.col2 AND
original.col3 = other.col3 AND ...
;
This will only delete the duplicates, leaving the first occurrences of each row behind, which is what I expect.
I tried to replicate that on BigQuery, and the only way I was able to get almost there was with MERGE
, with a similar statement:
MERGE
`table` orig
USING
(
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY col1, col2, col3, ...
) AS rn
FROM
table
) AS other
ON
(orig.col1 = other.col1 OR (orig.col1 IS NULL AND other.col1 IS NULL)) AND
(orig.col2 = other.col2 OR (orig.col2 IS NULL AND other.col2 IS NULL)) AND
(orig.col3 = other.col3 OR (orig.col3 IS NULL AND other.col3 IS NULL)) AND ...
WHEN MATCHED AND other.rn > 1 THEN
DELETE
;
This kind of works: it deletes all rows that have duplicates, including the first occurrences. I think it's the way BigQuey deletes stuff on MERGE
, like any field that matches those fields will be deleted or something, but I need the first occurrences to be kept back. Any ideas?