What is the efficient way of doing Date Partitioning in BigQuery of 2 TB data?

Question

Currently i have exported data as sharded to Google cloud, downloaded in server and streaming to the partitioned table, but the problem is that it takes long time. It streams like 1 Gb for 40 Minutes. Please help me to make it faster. My machine is 12 kernel and 20 Gb RAM CPU.

Why don't you load the data from Cloud Storage instead of using streaming? — Elliott Brossard, Jul 07 '17 at 11:39
I want to have all these data in bigquery again as partitioned, so i can use to query. — vidhya sagar, Jul 07 '17 at 12:23
@ElliottBrossard is that possible to export the data as partitioned to google cloud storage ? Because right now all datas are not exported orderly. — vidhya sagar, Jul 07 '17 at 12:42
Put the data in separate directories, then see Pentium10's answer. — Elliott Brossard, Jul 07 '17 at 12:47
I have 6 months data to be exported. So if i start to export in different directories, then it will take longer, because each date i have to run different queries to create a new table and then need to be exported by hand. If i am wrong, please correct me. — vidhya sagar, Jul 07 '17 at 12:58
https://stackoverflow.com/questions/38993877/migrating-from-non-partitioned-to-partitioned-tables/39001706#39001706 I have tried this solution too, but it doesnt work. @ElliottBrossard — vidhya sagar, Jul 07 '17 at 13:38
@Pentium10 I get this error. "Cannot query rows larger than 100MB limit." How can i solve this problem ? — vidhya sagar, Jul 10 '17 at 12:31

score 4 · Accepted Answer · answered Jul 07 '17 at 12:28

You can directly load data from Google Cloud Storage into your partition using a generated API call or other methods

To update data in a specific partition, append a partition decorator to the name of the partitioned table when loading data into the table. A partition decorator represents a specific date and takes the form:

$YYYYMMDD

For example, the following command replaces the data in the entire partition for the date January 1, 2016 (20160101) in a partitioned table named mydataset.table1 with content loaded from a Cloud Storage bucket:

bq load  --replace --source_format=NEWLINE_DELIMITED_JSON 'mydataset.table1$20160101' gs://[MY_BUCKET]/replacement_json.json

Note: Because partitions in a partitioned table share the table schema, replacing data in a partition will not replace the schema of the table. Instead, the schema of the new data must be compatible with the table schema. To update the schema of the table with the load job, use configuration.load.schemaUpdateOptions.

Read more https://cloud.google.com/bigquery/docs/creating-partitioned-tables

What is the efficient way of doing Date Partitioning in BigQuery of 2 TB data?

1 Answers1