0

Per the question re: making a custom Sink to update Schemas dynamically in Dataflow, I was wondering if the Patch operation is exposed in BigQueryIO API somewhere?

This is a crucial piece of updating schemas on the fly. We are merging schemas as they come in in backwards compatible ways.

Community
  • 1
  • 1
Matt
  • 1,194
  • 1
  • 18
  • 18
  • Not it's not. That's part of the BigQuery library. ParDo methods can call arbitrary code. So why don't you just import the BigQuery library, and call the path function that way? – Graham Polley Aug 12 '16 at 12:33
  • I tried something along those lines using a SideInput, but the bigquery client is not serializable. I will try it within the context of a parDo... – Matt Aug 12 '16 at 18:37
  • It seems that I cannot pass the client into ParDo directly (not serializable), but rather can call one of the nested classes (i.e. `List()`), and call execute() within the ParDo. Is this a safe approach? How are connections managed if I instantiate a connection at the top level, yet use nested class of parDo that call the client. – Matt Aug 12 '16 at 19:48
  • Make the BigQuery variable transient - like this: http://stackoverflow.com/questions/38709762/how-to-use-memcache-in-dataflow – Graham Polley Aug 12 '16 at 22:01

1 Answers1

1

This is not in the SDK itself, but can use a standard pattern of side-effecting from a DoFn. Specifically, you'll want to make sure that the BigQuery client you create is marked transient, as DoFns must be serializable.

class PatchFn extends DoFn<> {
  private transient BigQuery bq;
  ...
}
Sam McVeety
  • 3,194
  • 1
  • 15
  • 38