0

I think this is a very simple question but I would like some guidance: I didn't want to have to drop a table to send a new table with the deduplicated records, like using DELETE FROM based on the query below using BigQuery, is it possible? PS: This is a nested table!

SELECT
  *
FROM (
  SELECT
    *,
    ROW_NUMBER()
          OVER (PARTITION BY id, date_register) row_number
  FROM
    dataset.table)
WHERE
  row_number = 1 
 order by id, date_register
Felipe FB
  • 1,212
  • 6
  • 22
  • 55
  • 1
    Yes, you sure can, see this questions for more guidance: https://stackoverflow.com/questions/57900777/deduplicate-rows-in-a-bigquery-partition/57900778#57900778 – Ben P Oct 22 '19 at 12:41
  • @BenP In my case I have a row_number that I use to take duplicate and that is not in the table at its source. I did a test initially and he deleted all the records ... =D – Felipe FB Oct 22 '19 at 13:52

2 Answers2

2

To de-duplicate in place, without re-creating the table - use MERGE:

MERGE `temp.many_random` t
USING (
  SELECT DISTINCT *
  FROM `temp.many_random`
)
ON FALSE
WHEN NOT MATCHED BY SOURCE THEN DELETE
WHEN NOT MATCHED BY TARGET THEN INSERT ROW

It's simpler than the current accepted answer, as it won't ask you to match the current partitioning or clustering - it will just respect it.

Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325
  • @Felipe Hoffa I tried to use this, according to BenP´s comment but I can't make a distinct from the table because it is nested! Yun's option in this case was the one that worked in this case. If you happen to see any way to merge without distinct (and yes with row_number) I add this answer option! – Felipe FB Oct 23 '19 at 11:44
1

Update: please also check Felipe Hoffa's answer which is simpler, and learn more on this post: BigQuery Deduplication.


You need to exclude row_number from output and overwrite your table using CREATE OR REPLACE TABLE:

CREATE OR REPLACE TABLE your_table AS
PARTITION BY DATE(date_register) 
SELECT
  * EXCEPT(row_number)
FROM (
  SELECT
    *,
    ROW_NUMBER()
          OVER (PARTITION BY id, date_register) row_number
  FROM your_table)
WHERE
  row_number = 1 

If you don´t have a partition field defined at the source, I recommend that you create a new table with the partition field to make this query work so that you can automate the process.

Yun Zhang
  • 5,185
  • 2
  • 10
  • 29
  • What I need is to update the current table with the proposed select without having to do a DROP TABLE. As if you were going to do a DELETE FROM – Felipe FB Oct 22 '19 at 16:39
  • the BenP´s comment is as close as I need but I couldn't use row_number to delete the lines I need. – Felipe FB Oct 22 '19 at 16:41
  • @FelipeFB updated my answer with `CREATE OR REPLACE TABLE` – Yun Zhang Oct 22 '19 at 16:46
  • "Cannot replace a table with a different partitioning spec. Instead, DROP the table, and then recreate it. New partitioning spec is none and existing spec is interval(type:day)" – Felipe FB Oct 22 '19 at 16:51
  • @FelipeFB Add `PARTITION BY interval` clause before `AS SELECT` – Yun Zhang Oct 22 '19 at 17:02
  • new message: Cannot replace a table with a different partitioning spec. Instead, DROP the table, and then recreate it. New partitioning spec is interval(type:day,field:date_register) and existing spec is interval(type:day) – Felipe FB Oct 22 '19 at 17:08
  • look this: PARTITION BY DATE(_PARTITIONTIME): Partitions the table using the date-based timestamp in the _PARTITIONTIME pseudo column. This syntax is only supported with CREATE TABLE without the AS query_statement clause. #OMG – Felipe FB Oct 22 '19 at 17:19
  • I see the problem is you are dealing with an ingestion time paritioned table. – Yun Zhang Oct 22 '19 at 17:23
  • @FelipeFB, could you consider to make your source table a date-partitioned table instead of an ingestion time partitioned table? If this can be done, the query in this answer will just work. – Yun Zhang Oct 22 '19 at 17:45
  • Yes, I will consider this change! Thank you very much. – Felipe FB Oct 22 '19 at 19:29