5

I have a table in Google BigQuery that I access and modify in Python using the pandas functions read_gbq and to_gbq. The problem is that appending 100,000 lines takes about 150 seconds while appending 1 line takes about 40 seconds. I would like to update a value in the table rather than append a line, is there a way to update a value in the table using python that is very fast, or faster than 40 seconds?

user1367204
  • 4,549
  • 10
  • 49
  • 78

2 Answers2

6

Not sure if you can do so using pandas but you sure can using google-cloud library.

You could just install it (pip install --upgrade google-cloud) and run it like:

import uuid
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'path_to_json_credentials.json'
from google.cloud.bigquery.client import Client

bq_client = Client()

job_id = str(uuid.uuid4())
query = """UPDATE `dataset.table` SET field_1 = '3' WHERE field_2 = '1'"""
job = bq_client.run_async_query(query=query, job_name=job_id)
job.use_legacy_sql = False
job.begin()

Here this operation is taking 2s on average.

As a side note, it's important to keep in mind the quotas related to DML operations in BQ, that is, know when it's appropriate to use them and if they fit your needs well.

Willian Fuks
  • 11,259
  • 10
  • 50
  • 74
  • Do you think I am interpreting [this](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-manipulation-language#quotas) correctly in saying as I can only update 96 values per day in a single table? So if I have 8 columns and 100,000,000 rows, I can only update 96 of those 800,000,000 values per day? – user1367204 Jul 10 '17 at 13:18
  • 1
    You can run 96 UPDATE operations per day per table. You can UPDATE any amount of rows given your `WHERE` clause (for instance, `where user_id=2` will take all rows with user_id 2 and update them, even if it selects thousands of rows). – Willian Fuks Jul 10 '17 at 13:22
  • Thanks for helping me with that. In my specific case, I will have a separate table where each row represents a job to be executed, and there will be about 250,000 jobs (250,000 rows). One of the fields in each row is "datetime_of_last_update" that I wanted to `update` when the job is successfully executed by `update`-ing the value to the datetime that the job finished. Would you agree that due to the 96 UPDATE operations per day I would not be able to update the datetimes one by one be 250,000 jobs > the 96 quota? – user1367204 Jul 10 '17 at 13:36
  • yeah, I agree. BigQuery is not designed to handle data being constantly updated but rather it's more a DW solution. Maybe you could run some tests with [datastore](https://cloud.google.com/datastore/) and see if it works well for you in this case. – Willian Fuks Jul 10 '17 at 16:48
  • Starting from March 2020, DML without limits, now in BigQuery - https://cloud.google.com/blog/products/data-analytics/dml-without-limits-now-in-bigquery – Idhem Oct 06 '21 at 16:00
0

With the google-cloud library update, you can now do as follows .

from google.cloud import bigquery

client = bigquery.Client(credentials=credentials, project=project)

query = 
"""
(
UPDATE `dataset.table` SET field_1 = '3' WHERE field_2 = '1
)
"""

job_config = bigquery.QueryJobConfig(use_legacy_sql=False)

query_job = client.query(query, job_config=job_config)

print("Updated query")

for row in query_job:

    print(row)
Dr.Bit
  • 719
  • 1
  • 6
  • 14