0

BigQuery provides insertIds and does some de-duplication to help with the failure scenarios when inserting data via APIs. According to documentation, BigQuery will remember an insert for up to a minute, so if an insert fails, it is possible to retry the insert via the API without worrying about possible (inserted) data duplication. This can be tricky to get right.

The problem is that on Google Cloud there is a ton of services that promise to insert data into BigQuery. For example, DataFlow / Apache Beam is a part of the recommended stack for getting data from many sources into BigQuery. There are also Dataprep, Stackdriver Logging, and others.

So is there a single consistent way to recover failed inserts in BigQuery when using an arbitrary 3rd party BigQuery client, i.e. not the BigQuery API?

osipov
  • 840
  • 1
  • 8
  • 17

1 Answers1

1

No.

Different BigQuery clients use BigQuery APIs in different ways. This means that various Google Cloud services that offer export (or streaming) of data into BigQuery (e.g. Dataprep, Dataflow) have different strategies for dealing with failed BigQuery inserts.

If you need a consistent approach for BigQuery data de-duplication in case of failed inserts, you need to implement your own BigQuery API client application.

osipov
  • 840
  • 1
  • 8
  • 17