2

Is there any possibility to create BigQueryIO.Write-like sink specifying function that's going return table schema (or preferably read it from side input)?

In my use case I'm fetching messages from PubSub which are transformed to TableRow. Messages could have different schema which is always backward compatible.

In my pipeline I'm able to create the newest schema for particular window and such schema need to be used in the sink writing to BigQuery.

Is there any chance that such feature is available in SDK? ;)

Marcin Pietraszek
  • 3,134
  • 1
  • 19
  • 31
  • Marcin, how often do you expect the schema to be changing? – Sam McVeety Oct 19 '16 at 01:33
  • I don't know. That is meant to be automatic as we're exposing UI for changing schema to our end users. For that reason we'd like to eliminate all the manual work required here. – Marcin Pietraszek Oct 19 '16 at 05:09
  • (To answer the original question explicitly, this is not available in the SDK.) Is it possible to trigger a schema update via whatever code path the UI is using? That seems less potentially racy than just-in-time updating on message processing. – Sam McVeety Oct 19 '16 at 13:53
  • How that solution could help as BigQueryIO.Write requires schema that's going to be used during writing data to BigQuery? On the other hand we could stop current Dataflow Pipeline and start new one passing the newest schema to it (this is the way in which I'm doing it right now). We could also create a "demultiplexer" that's going to separate messages from one PubSub topic to N grouped by schema and having N Pipelines, but that'll double the cost. The whole point with the "dynamic schema" in one pipeline is to eliminate the hassle with the pipeline restarting. – Marcin Pietraszek Oct 20 '16 at 06:12
  • I may be mistaken, but my understanding is that the data being written to BigQuery is not validated against the schema when being streamed in. This is a streaming pipeline, right? – Sam McVeety Oct 22 '16 at 18:17
  • Data is validated against the schema prior to being written to pubsub topic from which we're reading it. Yes, it's streaming pipeline. We're using some additional component prior to pubsub that does that validation. – Marcin Pietraszek Oct 24 '16 at 06:23
  • If it is being validated against the schema, it seems like that schema either *(a) needs to be updated with the pipeline or (b) will reflect an updated (and backwards-compatible) change in the table schema. Is neither of these the case? – Sam McVeety Oct 25 '16 at 15:18
  • message1 (schema1) -> message2 (schema2) -> message3 (schema3) ourcomponent -> pubsub -> dataflow -> bigquery ourcomponent validates schema and forces it to be backward compatible. With one datflow piepline we'd like to handle the stream of messages presented above. And yes the GBQ schema needs to be updated and I've planned to do this as a part of our pipeline where for MAX schema number I'd use GBQ service to update it. And the problem is that BigQueryIO.Write requires static schema that is known during pipeline definition. Couldn't it be defined per window as it is with table? – Marcin Pietraszek Oct 26 '16 at 06:13
  • Yes, schema could be per-table. Could you file a JIRA issue for that? – Sam McVeety Oct 28 '16 at 00:57
  • Possible duplicate of [Writing different values to different BigQuery tables in Apache Beam](http://stackoverflow.com/questions/43505534/writing-different-values-to-different-bigquery-tables-in-apache-beam) – jkff May 04 '17 at 21:17
  • Nope, that's completely different thing. The other question says about X output tables. This question is about one table for which schema evolves over pipeline lifetime. – Marcin Pietraszek May 06 '17 at 20:22

1 Answers1

0

https://issues.apache.org/jira/browse/BEAM-2023 tracks the ability to pull schemas from side inputs in order to dynamically determine them. This functionality is now available in the latest version of Apache Beam.

Sam McVeety
  • 3,194
  • 1
  • 15
  • 38