Multi-Date data Load into BigQuery Partitioned table

Question

I am trying to explore BigQuery's abilities to load CSV file (Doulbelick impression data) into BigQuery's partitioned table. My use case includes: 1. Reading daily (nightly load) dumps (csv) from Google cloud storage for my customer's (ad agency) 30 different clients into BQ. Daily dump may contain data from previous day/week. All data should be loaded into respective daily partition (into BQ) so as to provide daily reporting to individual clients. 2.The purpose here is to build an analytical system that gives ad agency an ability to run "Trends & Pattern over time and across clients".

I am new to BQ and thus trying to understand its Schema layout. Should i create a single table with daily partitions (holding data from all 50 clients/50 daily load -csv files)? Does the partitions need to be created well in advance ? Should i create 50 different tables(partitioned by date) for each client so as NOT to run into any data sharing/security concerns of a single table option ? My customer wants a simple solution with min cost.

Why not simply use the new transfer service? It automatically loads all your Doubleclick data into partitioned tables in BigQuery. You don't need to hand-roll your own pipelines anymore (unless you've got some requirements that force you to) - https://cloud.google.com/bigquery/docs/doubleclick-publisher-transfer — Graham Polley, May 08 '17 at 00:09
Thanks Graham. Appreciate your reply. I thought of Data Transfer Service. I thought BQ's 'Federated sources' option will take care of cost(considering cost comparison of Transfer service) option because you dont stage data into BQ as it gives an ability to extract directly from GSC. — Bhavik Desai, May 08 '17 at 00:23
A quick question here: Is there any way to get sample file(impression file) from DoubleClick. I am trying to set-up Data-transfer-service for my trial account with BQ. — Bhavik Desai, May 08 '17 at 00:24
But you still have to pay for the queries and storage in GCS when using federated sources - and you take a hit on performance (federated sources will never be as quick as native tables). Federated sources are more designed for smaller mapping/reference like tables that you can join on to normalise your data. WRT to sample Doubleclick data, see here -> http://stackoverflow.com/questions/42190253/looking-for-a-sample-doubleclick-ad-serve-logs-for-bigquery-analysis — Graham Polley, May 08 '17 at 02:22
Got it. Many thanks Graham. Just to be on same page with you, you used Google DataFlow for data ingestion (and ETL) and not Data Flow Service ...am i right ? I think, my use-case is pretty simple. I have a one time extract (extracting a nightly batch file for each client having ~25GB impression data in each csv file) and load them into BQ. I think, you already recommended Data Transfer Service for that. — Bhavik Desai, May 08 '17 at 06:48
Graham,Could you point me if you have any documentation on Data Transfer Service setup ? Somehow, i could not find Impression file(Doubleclick sample data) in your links. Could you also validate what kind of BQ schema i should have ? I mean, should i create 50 tables (one table per client/advertiser) or just a single table for all 50 clients ? For loading late-arrival data(eg:monday's data into friday's log), should i create a job that reads one-by-one record and load them into respective daily partition ? or there is better way of making sure that ingestion goes into correct daily partition ? — Bhavik Desai, May 08 '17 at 06:50
Graham, It looks like, you have created a sample impression data by yourself using schema given at (https://support.google.com/dfp_premium/answer/1733128?hl=en)...Would it be a fair ask you if you can share sample with me ? — Bhavik Desai, May 08 '17 at 07:04
Graham: I want to try DataFlow option you mentioned. I believe it works like (1)Load entire csv file (having multi date data) in to BigQuery table (nonpartition) in one go (2)Create DF pipeline to read row-by-row data and (3) load them into respective partitions that are pre-created. Am i right ? Is there any cost involved in doing bq.io.read/io.write? — Bhavik Desai, May 10 '17 at 21:46

score 0 · Answer 1 · answered May 08 '17 at 21:32

0

If you are going to use transfer service (as mentioned in the comment), you don't need to create tables by hand. Instead transfer service will do that for you. Transfer service will schedule daily jobs and load data into partition. Also, if there is short delay (2-3 days), transfer service will still pick up the data.

answered May 08 '17 at 21:32

xuejian

195
5

Thanks Xuejian. Yes..i recognized this advantage. One of my concern is loading multiple date data into right partition.It looks like , i have to either use DataFlow to process data row-by-row. I am not sure how to do that as i am new to DataFlow. – Bhavik Desai May 10 '17 at 17:08
Is it like: Read entire csv file (with multiple date data in it) and load it into BigQuery's partition table(lets say a STAGING table) in one go. and then use DataFlow service to do row-by-row processing from that partition and load data into another table having date partitions ? – Bhavik Desai May 10 '17 at 17:10
I thought this option would not only incur high cost (reading data row-by-row from 12TB staging table) but also introduce one extra step of loading right data into right partition. Such extra step will delay report availability to the business...am i wrong in my understanding ? – Bhavik Desai May 10 '17 at 17:12
Is there any way where i can read row-by-row data from source (CSV File on GCS) and load them into pre-created partition table directly ? – Bhavik Desai May 10 '17 at 17:13

Multi-Date data Load into BigQuery Partitioned table

1 Answers1