We have a table in BigQuery that we need to export into a local newline-delimited JSON file. Using BigQuery's export to GCS functionality is problematic as it converts integer types to strings, see export bigquery data to cloud storage, the integer field changs to string format but float format stays as numeric format and How to preserve integer data type when exporting to JSON? and I've tried it myself and the ints are lost. We've come up with the following solution, which maintains integer types, but is very slow:
Currently working code
bq = bigquery.Client()
our_query = "select * from our_project.our_dataset.our_bq_table"
results_row_iter = bq.query(our_query) # google.bigquery.rowIterator
counter = 0
with open('/tmp/output_file.json', 'w') as f:
for row in results_row_iter:
f.write(json.dumps(dict(row), default=str) + '\n') # dumps as ndjson
our_bq_table
is 5GB in BigQuery, with 3.4M rows and ~100 fields, and the above for loop takes 90 minutes on our table. our_bq_table
is partitioned on the integer column confId
, and there are ~100 unique confId's in the table, with values 1 - 100. We would like to leverage the partition key + parallelization to speed up this process... somehow.
pseudo-code of what we're going for
bq = bigquery.Client()
base_query = "select * from our_project.our_dataset.our_bq_table"
all_conf_ids = range(1, 100)
def dump_conf_id(base_query, id):
iter_query = f"{base_query} where confId = {id}"
results_row_iter = bq.query(iter_query)
counter = 0
with open(f'output_file-{id}.json', 'w') as f:
for row in results_row_iter:
f.write(json.dumps(dict(row), default=str) + '\n') # dumps as ndjson
in parallel:
for id in all_conf_ids:
dump_conf_id(id)
# last step, perhaps concat the separate files into 1 somehow, assuming there are multiple output files...
This approach leverages the confId
field so that our BigQuery queries remain small. I'm not quite sure how to implement this beyond the pseudocode, and am overwhelmed with figuring out multi-threading vs multi-processing vs other ways to parallelize in python. Our final output needs to be a single output file, the pseudo-code dumps into separate files but if we can dump into a single file in parallel that would be great too.
Edit: a key question we're trying to resolve before implementing a solution is Should we use multiprocessing or multithreading for this, given that this is dumping to a local .json in parallel...