0

I have read some threads but I know too little sql to solve my problem. I have a table with a complex schema with records and nested fields.

Below you see a query which finds the exact row that I need to deduplicate.

SELECT * 
FROM my-data-project-214805.rfid_data.rfid_data_table 
WHERE DATE(_PARTITIONTIME) = "2020-02-07"  
AND DetectorDataMessage.Header.MessageID ='478993053'

DetectorDataMessage.Header.MessageID is supposed to be unique.

How can I delete one of these rows? (there are two)

If possible I would like deduplicate the whole table but its partitioned and I can't get it right. I try the suggestions in below threads but I get this error Column DetectorDataMessage of type STRUCT cannot be used in...

Threads of interest: Deduplicate rows in a BigQuery partition Delete duplicate rows from a BigQuery table

Any suggestions? Can you guide me in the right direction?

i_am_cris
  • 557
  • 1
  • 5
  • 19

2 Answers2

1

Try using a MERGE to remove the existing duplicate rows, and a single identical one. In this case I'm going for a specific date and id, as in the question:

MERGE `temp.many_random` t
USING (
  # choose a single row to replace the duplicates
  SELECT a.*
  FROM (
    SELECT ANY_VALUE(a) a
    FROM `temp.many_random` a
    WHERE DATE(_PARTITIONTIME)='2018-10-01'
    AND DetectorDataMessage.Header.MessageID ='478993053'
    GROUP BY _PARTITIONTIME, DetectorDataMessage.Header.MessageID 
  )
)
ON FALSE
WHEN NOT MATCHED BY SOURCE 
  # delete the duplicates
  AND DATE(_PARTITIONTIME)='2018-10-01'  
  AND DetectorDataMessage.Header.MessageID ='478993053'
THEN DELETE
WHEN NOT MATCHED BY TARGET THEN INSERT ROW

Based on this answer:

Felipe Hoffa
  • 54,922
  • 16
  • 151
  • 325
  • Buena compadre. Gracias. Unfortunately I get this error: "Omitting INSERT target column list is unsupported for ingestion-time partitioned table ....." – i_am_cris Feb 11 '20 at 07:17
0

If all of the values in the duplicate rows are the same, just use 'SELECT distinct'. If not, I would use the ROW_NUMBER() function to create a rank for each unique index, and then just choose the first rank.

I don't know what your columns are, but here's an example:

WITH subquery as
(select MessageId
ROW_NUMBER() OVER(partition by MessageID order by MessageId ASC) AS rank
)
select *
from subquery
where rank = 1
Julia
  • 1
  • 2
  • Thanks for helping. This works but it queries the distinct rows. I want to delete all duplicates from the table. The table have a lot of nested fields and is partitioned. – i_am_cris Feb 11 '20 at 07:24