You cannot prevent inserting duplicates in your BigQuery table by using the bq
command. But you can manually remove the duplicates after insertion:
#standardSQL
SELECT
MAX(count) FROM(
SELECT
[ID_COLUMN],
count(*) as count
FROM
`[TABLE_NAME]`
GROUP BY
[ID_COLUMN])
This query returns the highest number of duplicate entries, based on the id column. You, of course, have to have a unique identifier that would define what a duplicate is in the dataset (in this case duplicates are rows with the same value of the id column).
If the return of the query is >1, it means that duplicates exist. Then, by running the following query:
#standardSQL
SELECT
* EXCEPT(row_number)
FROM (
SELECT
*,
ROW_NUMBER()
OVER (PARTITION BY [ID_COLUMN]) row_number
FROM
`[TABLE_NAME]`)
WHERE
row_number = 1
you would receive a duplicate-free version of your table. Duplicates are recognised and unaccounted, again, via the id column and a helper column, row_number
. Note that it is then suggested to write the results in a new table (there is no altering of the original table, just a select of non-duplicated data).
Also, this question has a partial answer here:
https://stackoverflow.com/a/43587206/3615567.