Save the result of a query in a BigQuery Table, in Cloud Storage

Question

I would like to know what is the OPTIMAL way to store the result of a Google BigQuery table query, to Google Cloud storage. My code, which is currently being run in some Jupyter Notebook (in Vertex AI Workbench, same project than both the BigQuery data source, as well as the Cloud Storage destination), looks as follows:

# CELL 1 OF 2

from google.cloud import bigquery
bqclient = bigquery.Client()

# The query string can vary:
query_string = """
        SELECT *  
        FROM `my_project-name.my_db.my_table` 
        LIMIT 2000000
        """

dataframe = (
    bqclient.query(query_string)
    .result()
    .to_dataframe(
        create_bqstorage_client=True,
    )
)
print("Dataframe shape: ", dataframe.shape)

# CELL 2 OF 2:

import pandas as pd
dataframe.to_csv('gs://my_bucket/test_file.csv', index=False)

This code takes around 7.5 minutes to successfully complete.

Is there a more OPTIMAL way to achive what was done above? (It would mean faster, but maybe something else could be improved).

Some additional notes:

I want to run it "via a Jupyter Notebook" (in Vertex AI Workbench), because sometimes some data preprocessing, or special filtering must be done, which cannot be easily accomplished via SQL queries.
For the first part of the code, I have discarded pandas.read_gbq, as it was giving me some weird EOF errors, when (experimentally) "storing as .CSV and reading back".
Intuitively, I would focus the optimization efforts in the second half of the code (CELL 2 OF 2), as the first one was borrowed from the official Google documentation. I have tried this but it does not work, however in the same thread this option worked OK.
It is likley that this code will be included in some Docker image afterwards, so "as little libraries as possible" must be used.

Thank you.

You have also the export statement to avoid the table stage: https://cloud.google.com/bigquery/docs/reference/standard-sql/other-statements#export_data_statement — guillaume blaquiere, May 03 '22 at 20:20
Nice suggestion @guillaumeblaquiere. I will keep it in mind for when I have to directy export data "from BigQuery to Cloud Storage". That suggestion however, does not exactly match my use case, as "importing data to Jupyter Notebook" is mandatory; the reason was explained in my note # 1, at the bottom of my original post. Thank you! — David Espinosa, May 03 '22 at 21:09

score 5 · Answer 1 · answered May 04 '22 at 09:01

Use the EXPORT DATA statement:


EXPORT DATA OPTIONS(
  uri='gs://bucket/folder/*.csv',
  format='CSV',
  overwrite=true,
  header=true,
  field_delimiter=';') AS
SELECT *
FROM `my_project.my_db.my_table`
LIMIT 2000000

The * in the URI allows a table to be exported into multiple tables. This only matters if the exported table is bigger than 1GB (See here)

Execute Either directly in BQ Console, via command line or in your python script.

from google.cloud import bigquery

client = bigquery.Client()

query_job = client.query(
    """
    EXPORT DATA OPTIONS(
      uri='gs://bucket/folder/*.csv',
      format='CSV',
      overwrite=true,
      header=true,
      field_delimiter=';') AS
    SELECT *
    FROM `my_project.my_db.my_table`
    LIMIT 2000000
  """
)

results = query_job.result()  # Waits for job to complete.

A note on LIMIT: Note that LIMIT does not reduce the amount of data read in non-clustered tables. Only what is returned. (See here)

David Espinosa · Accepted Answer · 2022-05-04T20:34:46.067

After some experiments, I think I have got to a solution for my original post. First, the updated code:

import pandas as pd  # Just one library is imported this time

# This SQL query can vary, modify it to match your needs
query_string = """
SELECT *
FROM `my_project.my_db.my_table`
LIMIT 2000000
"""

# One liner to query BigQuery data.
downloaded_dataframe = pd.read_gbq(query_string, dialect='standard', use_bqstorage_api=True)

# Data processing (OPTIONAL, modify it to match your needs)
# I won't do anything this time, just upload the previously queried data

# Data store in GCS
downloaded_dataframe.to_csv('gs://my_bucket/uploaded_data.csv', index=False)

Some final notes:

I have not done an "in-depth research" about the processing speed VS the number of rows existing in a BigQuery table, however I saw that the processing time with the updated code and the original query, now takes ~6 minutes; that's enough for the time being. This answer might have some room for further improvements therefore, but it's better than the original situation.
The EOF error I mentioned in my original post was: ParserError: Error tokenizing data. C error: EOF inside string starting at row 70198. In the end I got to realize that it did not have anything to do with pandas_gbq function, but with "how I was saving the data". See, I was 'experimentally' storing the .csv file in the Vertex AI Workbench local storage, then downloading it to my local device, and when trying to open that data from my local device, I kept stumbling upon that error, however not getting the same when downloading the .csv data from Cloud Storage ... Why? Well, it happens that if you download the .csv data "very quickly" after "it gets generated" (i.e., after few seconds), from Vertex AI Workbench local storage, the data is simply still incomplete, but it does not give any error or warning message: it will simply "let you start with the download". For this reason, I think it is safer to export your data to Cloud Storage, and then download safely from there. This behaviour is more noticeable on large files (i.e. my own generated file, which had ~3.1GB in size).

Hope this helps.

Thank you.

score 0 · Answer 3 · answered May 03 '22 at 19:12

In this link, you would find the way to accomplish this task: https://cloud.google.com/bigquery/docs/samples/bigquery-extract-table?hl=en

Although, there are some points you need to take care.

This is a just extract, but if you would do some transformations you may use Dataflow or Composer, this last one allows you to use SQL transformations using BQ job to tun what yous want. On the other hand, Dataflow uses python to handle the code and create a Job.
Also you might want to take care of your bq table performance, partitioning and clustering it details https://cloud.google.com/bigquery/docs/clustered-tables ; also I noticed you are using limit statement, this won't be effective to your performance since this BQ is columnar so you are still scanning all the columns and all the data.

Snippet:

# from google.cloud import bigquery
# client = bigquery.Client()
# bucket_name = 'my-bucket'
project = "bigquery-public-data"
dataset_id = "samples"
table_id = "shakespeare"

destination_uri = "gs://{}/{}".format(bucket_name, "shakespeare.csv")
dataset_ref = bigquery.DatasetReference(project, dataset_id)
table_ref = dataset_ref.table(table_id)

extract_job = client.extract_table(
    table_ref,
    destination_uri,
    # Location must match that of the source table.
    location="US",
)  # API request
extract_job.result()  # Waits for job to complete.

print(
    "Exported {}:{}.{} to {}".format(project, dataset_id, table_id, destination_uri)
)

Hope this can help you :)

Save the result of a query in a BigQuery Table, in Cloud Storage

3 Answers3