2

Using DataFlow SDK 2.x, I would like to read data from BigQuery partitioned table.

The only option I found so far is using BigQueryIO.Read.fromQuery(String query) method and query SELECT * FROM table WHERE _PARTITIONTIME = 'yyyy-MM-dd'.

Is there any other way to read from partitioned BigQuery table in DataFlow?

bleare
  • 48
  • 6
  • 1
    This is the correct way. If you use a normal read, it will read the entire table into your pipeline, regardless if it's a partitioned table or not. – Graham Polley Mar 20 '18 at 10:09
  • @GrahamPolley, thanks for the answer. Do you plan adding method-level support for querying partitioned tables? – bleare Mar 20 '18 at 10:18
  • 2
    You can also use a partition decorator, read from "table$yyyyMMdd" should work I believe. There are currently no plans on adding more explicit support as far as I'm aware; if you need it, feel free to file a JIRA at https://issues.apache.org/jira/browse/BEAM or post on the user@ mailing list. – jkff Mar 20 '18 at 15:35
  • @jkff Is this behavior documented somewhere? – pradithya aria Nov 02 '18 at 07:31
  • It is documented at https://cloud.google.com/bigquery/docs/partitioned-tables – jkff Nov 02 '18 at 21:41

1 Answers1

0

Well to be honest, there is a better way to do it. Two ways to do it for performance and one if you look into BigQuery API.

So recall that

    Select * from `tableName` 

is an expensive operation for your database. In order to speed up the read time you could instead:

    Select <field1>, <field2>, ...., from `tableName`

Of course, listing all the field names is not possible in many cases, but you will get roughly twice the speed up.

But instead of doing that you can follow the example here. This will also give you roughly twice the speed up as well and is much more convenient.

Let me know if you need a coded example of the link provided, Cheers!

Haris Nadeem
  • 1,322
  • 11
  • 24
  • Thanks for the answer. Could you please elaborate more why using BigQuery API is faster than using BigQueryIO.Read? – bleare Mar 27 '18 at 20:54
  • Sorry my mistake. I misread the question and mistook BigQueryIO for BigQuery. That being said, your BigQueryIO uses the BigQuery API underneath as from what I can see from [here](https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java). So BigQueryIO would be the same speed as BigQuery or faster depending on the number of workers that are assigned to reading the table. Thus, it is faster with BigQueryIO, but the data read is passed as a PCollection. However, my suggestion of naming columns is better – Haris Nadeem Mar 28 '18 at 00:25
  • https://stackoverflow.com/questions/65512/which-is-faster-best-select-or-select-column1-colum2-column3-etc – Haris Nadeem Mar 28 '18 at 00:29