Is there anyway to merge data from a bucket into a table but without any duplicate rows?

Question

I have hundreds of CSV files in a google bucket. I add more data on a daily basis and then I reload the table using this bq command:

$ bq  load  --replace  --project_id=my_proj-123  my_dataset_id.fav_table gs://my_bucket/table_data/*  <my schema>

Unfortunately, some of my data has duplicates. Is there some way to load the tables but without duplicate rows?

Yes, the files in the bucket remain the same and I add a new file on a daily basis. After adding a new file, I nuke the existing table and remake it from all the files in the bucket. — user1367204, Nov 09 '17 at 13:47

score 2 · Answer 1 · answered Nov 09 '17 at 12:32

You cannot prevent inserting duplicates in your BigQuery table by using the bq command. But you can manually remove the duplicates after insertion:

#standardSQL
SELECT
  MAX(count) FROM(
  SELECT
    [ID_COLUMN],
    count(*) as count
  FROM
    `[TABLE_NAME]`
  GROUP BY
    [ID_COLUMN])

This query returns the highest number of duplicate entries, based on the id column. You, of course, have to have a unique identifier that would define what a duplicate is in the dataset (in this case duplicates are rows with the same value of the id column).

If the return of the query is >1, it means that duplicates exist. Then, by running the following query:

#standardSQL
SELECT
  * EXCEPT(row_number)
FROM (
  SELECT
    *,
    ROW_NUMBER()
          OVER (PARTITION BY [ID_COLUMN]) row_number
  FROM
    `[TABLE_NAME]`)
WHERE
  row_number = 1

you would receive a duplicate-free version of your table. Duplicates are recognised and unaccounted, again, via the id column and a helper column, row_number. Note that it is then suggested to write the results in a new table (there is no altering of the original table, just a select of non-duplicated data).

Also, this question has a partial answer here: https://stackoverflow.com/a/43587206/3615567.

Is there anyway to merge data from a bucket into a table but without any duplicate rows?

1 Answers1