BigQuery DeDuplicate Data

Question

I am trying to find the best way to DeDuplicate my data set, and am having trouble finding a solution I can implement. To explain, currently have a GCS bucket that I upload a .csv into daily. Each of these .csv files has the data for the previous 5 days. Due to many overlapping days I end up with quite a few duplicates in my Table in BigQuery. (Note: I have to upload this way as due to delays in our systems sometimes records don't get added for a day or two, and the date they show is the date the transaction took place, not the date it was added.)

When the duplicates are added they are completely identical, for all 30 of the columns that we have in the data set. What do I need to do in order to remove the duplicates? We have a "Transaction ID" column as the first column, and it is a distinct ID per transaction. I assume I can use this, just not sure how.

I tried the below, and it removed all the null values, but not duplicates with actual information in the rows.

CREATE OR REPLACE TABLE project.dataset.tableDeDuped

AS

SELECT DISTINCT * FROM project.dataset.table

score 1 · Answer 1 · answered May 18 '22 at 00:22

1

You could either try with ROW_NUMBER() or ARRAY_AGG() as shown here.

Delete duplicate rows from a BigQuery table

Check Jordan and Felipe's answers, if you have any questions, let us know.

answered May 18 '22 at 00:22

FelipeAvendanoE

11
2

BigQuery DeDuplicate Data

1 Answers1