0

My scenario is a variation on the one discussed here: How do I write to BigQuery using a schema computed during Dataflow execution?

In this case, the goal is that same (read a schema during execution, then write a table with that schema to BigQuery), but I want to accomplish it within a single pipeline.

For example, I'd like to write a CSV file to BigQuery and avoid fetching the file twice (once to read schema, once to read data).

Is this possible? If so, what's the best approach?


My current best guess is to read the schema into a PCollection via a side output and then use that to create the table (with a custom PTransform) before passing the data to BigQueryIO.Write.

Community
  • 1
  • 1
dmb
  • 603
  • 4
  • 11
  • Possible duplicate of [Writing different values to different BigQuery tables in Apache Beam](http://stackoverflow.com/questions/43505534/writing-different-values-to-different-bigquery-tables-in-apache-beam) – jkff May 04 '17 at 21:17

1 Answers1

2

If you use BigQuery.Write to create the table then the schema needs to known when the table is created.

Your proposed solution of not specifying the schema when you create the BigQuery.Write transform might work, but you might get an error because the table doesn't exist and you aren't configuring BigQueryIO.Write to create it if needed.

You might want to consider reading just enough of your CSV files in your main program to determine the schema before running your pipeline. This would avoid the complexity of determining the schema at runtime. You would still incur the cost of the extra read but hopefully that's minimal.

Alternatively you create a custom sink to write your data to BigQuery. Your Sinks could write the data to GCS. Your finalize method could then create a BigQuery load job. Your custom sink could infer the schema by looking at the records and create the BigQuery table with the appropriate schema.

Jeremy Lewi
  • 6,386
  • 6
  • 22
  • 37
  • Do you think it'd be advisable to use BigQueryTableInserter directly? – dmb Jun 11 '15 at 16:03
  • I'd probably recommend creating a custom sink rather than using a BigQuery TableInserter. I updated my answer to describe this. – Jeremy Lewi Jun 12 '15 at 14:51