3

Background: We are using cloud data flow runner in Beam 2.0 to ETL our data to our warehouse in BigQuery. We would like to use the BigQuery Client Libraries (Beta) to create the schema of our data warehouse prior to the beam pipelines populating them with data. (Reasons: full control over table definitions,e.g. partitioning, ease of creating DW instances, i.e. datasets,separation of ETL logic from DW design, and code modularisation)

Problem: The BigQury IO in Beam uses TableFieldSchema and TableSchema Classes under com.google.api.services.bigquery.model for representing BigQuery fields and schemas, while the BigQuery Client Libraries uses TableDefinitionunder com.google.cloud.bigquerypackage for the same stuff, so the field and schema definitions can not be defined in one place and re-used at another place.

Is there a way to define the schema at one place and re-use it?

Thanks, Soby

p.s. we are using the Java SDK in Beam

1 Answers1

1

A similar question was asked here.

I wrote some utils and published them on GitHub that might be of interest to you.

The ParseToProtoBuffer.py downloads the schema from BigQuery and parses it into a Protobuf schema (you might want to look into Protobuffers to boost your pipelines performance as well). If you compile this into a Java class, use it in your project you can use the makeTableSchema function in ProtobufUtils.java to get the TableSchema for that class. You might want to use makeTableRow as well if you decide to develop your pipeline with Protobuffers.

The code I pushed there is WIP and not being used in production or anything yet, but I hope it gives you a push in the right direction.

Matthias Baetens
  • 1,432
  • 11
  • 18