3

I have a pipeline which reads streaming data from Cloud Pub/Sub, this data is processed by Dataflow, then saved into one large BigQuery table, each Pub/Sub message includes an associated account_id. Is there a way to create new tables on the fly when a new account_id is identified? And then populate them with data from that associated account_id?

I know that this can be done by updating the pipeline for each new account. But in an ideal world, Cloud Dataflow would generate these tables within the code programmatically.

Owen Monagan
  • 109
  • 8
  • Do I understand correctly - you don't know the name of these BigQuery tables/sinks in your pipeline until runtime i.e. they are dynamic. Is that right? – Graham Polley Mar 27 '17 at 12:14
  • Correct, I want to create the tables on runtime in a dynamic manner. I do know their names, but we get new accounts all the time and it is not time efficient to constantly update the pipeline for each account. – Owen Monagan Mar 27 '17 at 14:21
  • And the issue with using one large table for all accounts is: cost & speed of iterating through every account's row data. – Owen Monagan Mar 27 '17 at 16:58
  • Please also see this answer http://stackoverflow.com/questions/43505534/writing-different-values-to-different-bigquery-tables-in-apache-beam/43505535 – jkff Apr 19 '17 at 20:32

2 Answers2

1

wanted to share few options I see

Option1 - wait for Partition on non-date field feature
It is not know when this is going to be implemented and available for us, so it might be not what you want now. But when this will go live - this will be the best option for such scenarios

Option 2 – you can come up with hashing your account_id into predefined number of buckets. In this case you can pre-create all those tables and in your code have logic that will handle respective destination table based on account hash. Same hashing logic than needs to be used in queries that will query that data

Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
0

The API for creating BigQuery Tables is at https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/insert.

Nevertheless, it would probably be easier if you store all accounts in one static table that contains account_id as one column.

Mingyu Zhong
  • 471
  • 4
  • 3
  • 1
    While it is easier, it is also more expensive to my organization. Is there a way of achieving the above question using cloud dataflow? – Owen Monagan Mar 27 '17 at 16:54
  • Please see Mikhail Berlyant's reply. If you want to minimize work, Partition on non-date field is the best option. – Mingyu Zhong Mar 29 '17 at 17:48