2

I have a dataset at BigQuery with 100 thousand+ rows and 10 columns. I'm also continuously adding new data to the dataset. I want to fetch data that not processed, process them and write back to my table. Currently, I'm fetching them to a pandas dataframe using bigquery python library and processing using pandas.

Now, I want to update table with new pre-processed data. One way of doing it using SQL statement and calling query function of the bigquery.Client() class. Or use a job like here.

bqclient = bigquery.Client(
            credentials=credentials,
            project=project_id,
            )

query = """UPDATE `dataset.table` SET field_1 = '3' WHERE field_2 = '1'"""
bqclient.query(query_string)

But it doesn't make sense to create update statement for each row.

Another way I found is using to_gbq function of pandas-gbq package. Disadvantage of this , it updates all table.

Question: What is the best way of updating Bigquery table from pandas dataframe?

Ilkin
  • 386
  • 3
  • 17

1 Answers1

2

Google BigQuery is mainly used for Data Analysis when your data is static and you don't have to update a value, since the arquitecture is basically to do that kind of thinking. Therefore, if you want to update the data, there are some options but are very heavy:

  1. The one you mentioned, with a query and update one by one row.
  2. Recreate the table using only the new values.
  3. Appending the new data with different timestamp.
  4. Using partitioned tables [1] and if possible clustered tables [2], this way when you want to update the table you can use the partitioned and clustered columns to update it and the query will be less heavy. Also, you can append the new data in a new partitioned table, let's say on the current day.

If you are using the data for analytical reasons, maybe the best option is 2 and 3, but I always recommend having [1] and [2].

[1] https://cloud.google.com/bigquery/docs/querying-partitioned-tables

[2] https://cloud.google.com/bigquery/docs/clustered-tables

  • Yes, bigquery has update limits and it seems it is not a good idea to update row by row. Instead, I'm going to do step 2 as you said. I created a new table and will append preprocessed inputs there. Only problem in this case is , I'm using pandas-gbq library and its `to_gbq` function is a bit slow – Ilkin Jul 06 '20 at 11:11
  • Hi, there is [this](https://stackoverflow.com/questions/48886761/efficiently-write-a-pandas-dataframe-to-google-bigquery) other post where alternatives to ```to_gbq``` are discussed. It seems that the fastest way is to save the data into a CSV, move it to GCS and then load the data into BQ. Would this be of help for your use-case? – juferafo Jul 06 '20 at 14:17