I have several bigQuery partitioned tables which have lots of duplicates and I need to remove those rows and keep one row only. I used the below solution and it works for most of my tables. Two of my tables are really huge (15,000,000,000 rows) and the query fails as it is too large.
I was thinking to filter by partition but if I add where clause like WHERE date(datestamp) > '01-04-2023'
then any data before this date will be removed.
CREATE OR REPLACE TABLE
`project.dataset.table`
PARTITION BY
DATE(datestamp) AS
SELECT
*
FROM
`project.dataset.table` QUALIFY ROW_NUMBER() OVER (PARTITION BY id, CAST(value AS STRING), names, parameter) = 1
A solution that is not ideal but I think works could be: instead of overwriting the main table, create 3-4 tables for different partitiontime. Then union all those tables into one. In this case, we query about 30% of the data each time. Downside: lots of manual work.