Remove duplicate rows from huge partitioned table in BigQuery

Question

I have several bigQuery partitioned tables which have lots of duplicates and I need to remove those rows and keep one row only. I used the below solution and it works for most of my tables. Two of my tables are really huge (15,000,000,000 rows) and the query fails as it is too large. I was thinking to filter by partition but if I add where clause like WHERE date(datestamp) > '01-04-2023' then any data before this date will be removed.

CREATE OR REPLACE TABLE
  `project.dataset.table`
PARTITION BY
  DATE(datestamp) AS
SELECT
  *
FROM
  `project.dataset.table` QUALIFY ROW_NUMBER() OVER (PARTITION BY id, CAST(value AS STRING), names, parameter) = 1

A solution that is not ideal but I think works could be: instead of overwriting the main table, create 3-4 tables for different partitiontime. Then union all those tables into one. In this case, we query about 30% of the data each time. Downside: lots of manual work.

Hi there, for the two really huge tables (15,000,000,000 rows) have you tried running table query without partitioning on date and with enabling the option Query settings on bigquery editor to set destination table for query results? The same can also be done as schedule query where you can set as destination table. hope this helps — Bihag Kashikar, Apr 19 '23 at 09:50
It would be better to increase whatever quota policy or resource limits you have, since this sounds like a critical need. Sounds crazy to go through a lot of effort to do something that needs done, in some round about manual set of work when you could just throw more processing power at the problme. — Josh, Apr 19 '23 at 13:33
I am still trying to fix the issue, I will get back once I successfully finish the task — Sana, Apr 25 '23 at 07:48
Hi @sans, Do these [link1](https://stackoverflow.com/questions/53650596/),[link2](https://stackoverflow.com/questions/36675521/) and [link3](https://medium.com/google-cloud/bigquery-deduplication-14a1206efdbb) help you? — kiran mathew, Apr 28 '23 at 13:01

Remove duplicate rows from huge partitioned table in BigQuery

0 Answers0