Deduplicate rows in complex schema in a bigquery partition

Question

I have read some threads but I know too little sql to solve my problem. I have a table with a complex schema with records and nested fields.

Below you see a query which finds the exact row that I need to deduplicate.

SELECT * 
FROM my-data-project-214805.rfid_data.rfid_data_table 
WHERE DATE(_PARTITIONTIME) = "2020-02-07"  
AND DetectorDataMessage.Header.MessageID ='478993053'

DetectorDataMessage.Header.MessageID is supposed to be unique.

How can I delete one of these rows? (there are two)

If possible I would like deduplicate the whole table but its partitioned and I can't get it right. I try the suggestions in below threads but I get this error Column DetectorDataMessage of type STRUCT cannot be used in...

Threads of interest: Deduplicate rows in a BigQuery partition Delete duplicate rows from a BigQuery table

Any suggestions? Can you guide me in the right direction?

score 1 · Answer 1 · answered Feb 10 '20 at 19:58

Try using a MERGE to remove the existing duplicate rows, and a single identical one. In this case I'm going for a specific date and id, as in the question:

MERGE `temp.many_random` t
USING (
  # choose a single row to replace the duplicates
  SELECT a.*
  FROM (
    SELECT ANY_VALUE(a) a
    FROM `temp.many_random` a
    WHERE DATE(_PARTITIONTIME)='2018-10-01'
    AND DetectorDataMessage.Header.MessageID ='478993053'
    GROUP BY _PARTITIONTIME, DetectorDataMessage.Header.MessageID 
  )
)
ON FALSE
WHEN NOT MATCHED BY SOURCE 
  # delete the duplicates
  AND DATE(_PARTITIONTIME)='2018-10-01'  
  AND DetectorDataMessage.Header.MessageID ='478993053'
THEN DELETE
WHEN NOT MATCHED BY TARGET THEN INSERT ROW

Based on this answer:

Deduplicate rows in a BigQuery partition

Buena compadre. Gracias. Unfortunately I get this error: "Omitting INSERT target column list is unsupported for ingestion-time partitioned table ....." — i_am_cris, Feb 11 '20 at 07:17

Julia · Answer 2 · 2020-02-10T20:49:16.917

0

If all of the values in the duplicate rows are the same, just use 'SELECT distinct'. If not, I would use the ROW_NUMBER() function to create a rank for each unique index, and then just choose the first rank.

I don't know what your columns are, but here's an example:

WITH subquery as
(select MessageId
ROW_NUMBER() OVER(partition by MessageID order by MessageId ASC) AS rank
)
select *
from subquery
where rank = 1

edited Feb 10 '20 at 20:49

answered Feb 10 '20 at 19:38

Julia

1
2

Thanks for helping. This works but it queries the distinct rows. I want to delete all duplicates from the table. The table have a lot of nested fields and is partitioned. – i_am_cris Feb 11 '20 at 07:24

Deduplicate rows in complex schema in a bigquery partition

2 Answers2