1

I have hundreds of CSV files in a google bucket. I add more data on a daily basis and then I reload the table using this bq command:

$ bq  load  --replace  --project_id=my_proj-123  my_dataset_id.fav_table gs://my_bucket/table_data/*  <my schema>

Unfortunately, some of my data has duplicates. Is there some way to load the tables but without duplicate rows?

Tudormi
  • 1,092
  • 7
  • 18
user1367204
  • 4,549
  • 10
  • 49
  • 78

1 Answers1

2

You cannot prevent inserting duplicates in your BigQuery table by using the bq command. But you can manually remove the duplicates after insertion:

#standardSQL
SELECT
  MAX(count) FROM(
  SELECT
    [ID_COLUMN],
    count(*) as count
  FROM
    `[TABLE_NAME]`
  GROUP BY
    [ID_COLUMN])

This query returns the highest number of duplicate entries, based on the id column. You, of course, have to have a unique identifier that would define what a duplicate is in the dataset (in this case duplicates are rows with the same value of the id column).

If the return of the query is >1, it means that duplicates exist. Then, by running the following query:

#standardSQL
SELECT
  * EXCEPT(row_number)
FROM (
  SELECT
    *,
    ROW_NUMBER()
          OVER (PARTITION BY [ID_COLUMN]) row_number
  FROM
    `[TABLE_NAME]`)
WHERE
  row_number = 1

you would receive a duplicate-free version of your table. Duplicates are recognised and unaccounted, again, via the id column and a helper column, row_number. Note that it is then suggested to write the results in a new table (there is no altering of the original table, just a select of non-duplicated data).

Also, this question has a partial answer here: https://stackoverflow.com/a/43587206/3615567.

Tudormi
  • 1,092
  • 7
  • 18