0

Hive Partitioned Tables have a folder structure with partition date as the folder. I have explored loading externally partitioned tables directly to bigquery which is possible.

What I would like to know is if this feature is possible to do with dataflow since I am going to be running some feature transforms and such using dataflow before loading the data into bigquery. what I have found is if I add the partition date as a column then partitioning is possible but I am looking for a direct method with which I wouldn't be adding the column during transforms but directly while loading data into bigquery.

Is such a thing possible?

Kevin Hernandez
  • 1,270
  • 2
  • 19
  • 41
  • Do you want to load hive partitioned tables from Dataflow, or another kind of [partition](https://cloud.google.com/bigquery/docs/partitioned-tables#top_of_page) ? Keep in mind that to use HivePartitioning the data will still need to be loaded from Google Cloud Storage, even if you process it with Dataflow – Tlaquetzal Dec 06 '19 at 22:52
  • hive partitioned table using dataflow from gcs, and since it is hive partitioned table the column itself will not exist in the original data. but it should be present as a column after the data is loaded in bigquery. – Abhinav Rai Dec 08 '19 at 04:50

1 Answers1

0

Hive partitioning is a beta feature in BigQuery, and it was released on Oct 31st, 2019. The latest version of Apache Beam SDK supported by Dataflow is 2.16.0 which was released on Oct 7th, 2019. At the moment, neither Java nor Python supports this feature directly. So, if you want to use it from Dataflow, maybe you could try calling the BigQuery API directly

Tlaquetzal
  • 2,760
  • 1
  • 12
  • 18