I would like to know what is the OPTIMAL way to store the result of a Google BigQuery table query, to Google Cloud storage. My code, which is currently being run in some Jupyter Notebook (in Vertex AI Workbench, same project than both the BigQuery data source, as well as the Cloud Storage destination), looks as follows:
# CELL 1 OF 2
from google.cloud import bigquery
bqclient = bigquery.Client()
# The query string can vary:
query_string = """
SELECT *
FROM `my_project-name.my_db.my_table`
LIMIT 2000000
"""
dataframe = (
bqclient.query(query_string)
.result()
.to_dataframe(
create_bqstorage_client=True,
)
)
print("Dataframe shape: ", dataframe.shape)
# CELL 2 OF 2:
import pandas as pd
dataframe.to_csv('gs://my_bucket/test_file.csv', index=False)
This code takes around 7.5 minutes to successfully complete.
Is there a more OPTIMAL way to achive what was done above? (It would mean faster, but maybe something else could be improved).
Some additional notes:
- I want to run it "via a Jupyter Notebook" (in Vertex AI Workbench), because sometimes some data preprocessing, or special filtering must be done, which cannot be easily accomplished via SQL queries.
- For the first part of the code, I have discarded pandas.read_gbq, as it was giving me some weird EOF errors, when (experimentally) "storing as .CSV and reading back".
- Intuitively, I would focus the optimization efforts in the second half of the code (
CELL 2 OF 2
), as the first one was borrowed from the official Google documentation. I have tried this but it does not work, however in the same thread this option worked OK. - It is likley that this code will be included in some Docker image afterwards, so "as little libraries as possible" must be used.
Thank you.