0

I'm in trouble in loading Huge data to Bigquery.

In GCS, I have huge & many files like this:

gs://bucket/many_folders/yyyy/mm/dd/many_files.gz

I want to load it to BigQuery, so first, I tried:

bq load --source_format=NEWLINE_DELIMITED_JSON \
  --ignore_unknown_values\
  --max_bad_records=2100000000\
  --nosync\
  project:dataset.table \
  gs://bucket/* \
  schema.txt

which failed because of it exceeded "max_bad_records" limit(the file is an aggregation of many types of log so it causes many errors).

Then I calculated to found that I need to use "*" like:

bq load --source_format=NEWLINE_DELIMITED_JSON \
  --ignore_unknown_values\
  --max_bad_records=2100000000\
  --nosync\
  gs://bucket/many_folders/yyyy/mm/dd/*\
  schema.txt

because of the max_bad_records limitation.

But I found it is very slow(because of pararell-run limitation in BigQuery). And it exceedes daily loading job limitation also. I prefer not doing this option.

Any idea for solving this situation? I want to load this data as fast as I can.

Thank you for reading.

  • You may use [Dataflow](https://cloud.google.com/dataflow/) to read your data from Cloud Storage and load them to BigQuery. You have several examples in how to read from multiple paths here [(1)](https://googlecloud.tips/tips/033-reading-in-multiple-files-with-dataflow.1/), [(2)](https://stackoverflow.com/questions/48700041/how-to-match-multiple-files-with-names-using-textio-read-in-cloud-dataflow/48700475#48700475). – Yurci Jul 17 '18 at 10:29

1 Answers1

0

I solved it by loading GCS data as one column. Then as a next step I parsed the data.