51

Google BigQuery has no primary key or unique constraints.

We cannot use traditional SQL options such as insert ignore or insert on duplicate key update so how do you prevent duplicate records being inserted into Google BigQuery?

If I have to call delete first (based on unique key in my own system) and then insert to prevent duplicate records being inserted into bigquery, wouldn't that that be too inefficient? I would assume that insert is the cheapest operation, no query, just append data. For each insert if I have to call delete, it will be too inefficient and cost us extra money.

What is your advice and suggestions based on your experience?

It would be nice that bigquery has primary key, but it might be conflict with the algorithms/data structure that bigquery is based on?

Mani Gandham
  • 7,688
  • 1
  • 51
  • 60
searain
  • 3,143
  • 6
  • 28
  • 60
  • are you streaming? streaming prevents duplicate records within a timeframe based on ids provided by you – Felipe Hoffa Mar 23 '17 at 23:25
  • I am not familiar with streaming on bigquery. We setup transfers in bigquery to pull data from Youtube Reports. Sometimes the transfer did not get the report data we want (maybe the data was not ready yet). We have to rerun the transfers to pull the missing report. Google told us even we rerun the transfers, there will not be duplicated records. Is that bigquery transfer using the streaming? The duplicated comes from local database. I need to load data from my local mysql database to bigquery too. I have to prevent it on the application level. I will check out the streaming solution. – searain Mar 23 '17 at 23:39

4 Answers4

49

So let's clear some facts up in the first place.

Bigquery is a managed data warehouse suitable for large datasets, and it's complementary to a traditional database, rather than a replacement.

Up until early 2020 there was only a maximum of 96 DML (update,delete) operations on a table per day. That low limited forced you to think of BQ as a data lake. That limit has been removed but it demonstrates that the early design of the system was oriented around "append-only".

So, on BigQuery, you actually let all data in, and favor an append-only design. That means that by design you have a database that holds a new row for every update. Hence if you want to use the latest data, you need to pick the last row and use that.

We actually leverage insights from every new update we add to the same row. For example, we can detect how long it took for the end-user to choose his/her country at signup flow. Because we have a dropdown of countries, it took some time until he/she scrolled to the right country, and metrics show this, because we ended up in BQ with two rows, one prior country selected, and one after country selected and based on time selection we were able to optimize the process. Now on our country drop-down we have first 5 most recent/frequent countries listed, so those users no longer need to scroll and pick a country; it's faster.

simbo1905
  • 6,321
  • 5
  • 58
  • 86
Pentium10
  • 204,586
  • 122
  • 423
  • 502
  • We run bigquery only on daily loaded data and generated daily reports. We will keep export these daily reports in storage and dump to elasticsearch which will be our output api, So keep our bigquery data clean. I could control in application layer that no duplicated data load will be allowed. Also to add error proof check on BigQuery layer, I can do one DML operation before bulk load, delete all the data of the date before I load the data for that date. Would that be good practice? – searain Mar 22 '17 at 18:24
  • What volume are we talking about? Isn't easier if you keep in BQ everything and adjust the queries to read last row? – Pentium10 Mar 22 '17 at 21:05
  • Right now, the volume is about millions. It could become bigger later on. There are some join queries and aggregations on big query (sum/average on group by etc.) to get our final results. For output api, elasticsearch will be more efficient. And we keep copies in storage purely just as backup just in case. – searain Mar 22 '17 at 21:14
  • makes sense to adjust query to read most recent version of each row, however you need a timestamp column for that. Does it have to be explicitly defined or does it exist by default? Given that GBQ is a Data Lake, I would expect it to exist by default. – Giacomo Mar 01 '19 at 14:10
  • 2
    As of March 2020 DML operations no longer have quote limits in bigquery: https://cloud.google.com/blog/products/data-analytics/dml-without-limits-now-in-bigquery – XstiX Sep 22 '20 at 00:09
11

"Bulk Delete and Insert" is the approach I am using to avoid the duplicated records. And Google's own "Youtube BigQuery Transfer Services" is using "Bulk Delete and Insert" too.

"Youtube BigQuery Transfer Services" push daily reports to the same set of report tables every day. Each record has a column "date".

When we run Youtube Bigquery Transfer backfill (ask youtube bigquery transfer to push the reports for certain dates again.) Youtube BigQury Transfer services will first, delete the full dataset for that date in the report tables and then insert the full dataset of that date back to the report tables again.

Another approach is drop the results table (if it already exists) first, and then re-create the results table and re-input the results into the tables again. I used this approach a lot. Everyday, I have my process data results saved in some results tables in the daily dataset. If I rerun the process for that day, my script will check if the results tables for that day exist or not. If table exists for that day, delete it and then re-create a fresh new table, and then reinput the process results to the new created table.

searain
  • 3,143
  • 6
  • 28
  • 60
  • "Bulk Delete and Insert"- also if it applies, we should use the partition. just delete the partition copy and add new partition copy. – searain Aug 17 '20 at 16:06
  • How would you tackle `UPDATE or DELETE statement over table *** would affect rows in the streaming buffer, which is not supported` errors? – Hans Wassink Sep 10 '20 at 13:05
  • 1
    I am not using streaming now. I load the data to Google Cloud Storage Bucket and then load data from GCP bucket to BigQuery. That solves a lot of problem of streaming to BigQuery. Plus, I will try to use less update and delete. – searain Sep 16 '20 at 21:59
  • Ah interesting, what process/API do you use to get (json) data in GCS into BQ? Or do you just use the UI to do that? Im looking for a way to automate this. – Hans Wassink Sep 17 '20 at 07:08
  • 1
    I used to use php client library for load data from storage to BigQuery. of course, I also need to use php client library to load data to to storage first. A little bit more works than necessary. So I was using streaming to use less steps. But streaming had all these issues. Now we use airflow (Google Cloud Composer), even we cannot skip the steps to load data to storage and then load data from storage to bigquery. but airflow has all these operators you can use, https://airflow.apache.org/docs/stable/_api/airflow/contrib/operators/gcs_to_bq/index.html – searain Sep 17 '20 at 23:24
2

BigQuery now doesn't have DML limits.

https://cloud.google.com/blog/products/data-analytics/dml-without-limits-now-in-bigquery

Brandon Stewart
  • 600
  • 1
  • 8
  • 12
2

Google BigQuery now supports primary and foreign keys when creating tables, which provide data integrity by ensuring primary key values are unique and foreign key values are present in the primary key column of the primary key table. While you can still add primary keys and constraints to existing tables using ALTER TABLE statements, it's important to consider the trade-offs between data integrity, costs, and performance when using BigQuery as a relational database, as it's optimized for analyzing large, denormalized datasets. Additionally, BigQuery has several other recently added features, including partitioned tables, materialized views, SQL scripting, and BI Engine integration, that can improve query performance, reduce costs, and provide faster data visualizations.

I was discussing with Google Team about this new implementation, they said that is not a solution to avoid duplicated keys, but it will be used to speed up queries.

I'm testing this feature and discussing with tech guys how this will work in future.

Source: https://medium.com/codex/google-launches-primary-and-foreign-keys-for-bigquery-ffeb5b632965

Documentation: https://cloud.google.com/bigquery/docs/information-schema-constraint-column-usage

Guergolet
  • 114
  • 1
  • 6
  • 3
    According to google doc, the primary keys and foreign keys are for speeding up queries and it is not enforced; means you cannot prevent duplicate values from being inserted into a table by just defining the unique column as a primary key – Akhil Mar 27 '23 at 13:07