0

My table name format: tableName_YYYYMMDD. I am trying to write to this table from a streaming dataflow pipeline. The reason I want to write to a new table everyday is because I want to expire tables after 30 days and only want to keep a window of 30 tables at a time.

Current code:

tableRow.apply(BigQueryIO.Write
                .named("WriteBQTable")
                .to(String.format("%1$s:%2$s.%3$s",projectId, bqDataSet, bqTable))
                .withSchema(schema)
                .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
                .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));

I do realize above code will not roll over to new day and start writing there.

As this answer suggests I can partition table and expire partitions, but writing to a partitioned tables seems like is not supported from a streaming pipeline.

Any ideas how can I work around this?

PUG
  • 4,301
  • 13
  • 73
  • 115

2 Answers2

0

In the Dataflow 2.0 SDK there is a way to specify DynamicDestinations

See to(DynamicDestinations<T,?> dynamicDestinations) in BigQuery Dynamic Destionations.

Also, see the TableDestination version, which should be simpler and less code. Though unfortunately there is no example in the javadoc.

to(SerializableFunction<ValueInSingleWindow<T>,TableDestination> tableFunction)

https://beam.apache.org/documentation/sdks/javadoc/2.0.0/

Pablo
  • 10,425
  • 1
  • 44
  • 67
Alex Amato
  • 1,685
  • 10
  • 15
  • I have been testing this with a partitioned table writing from dataflow, looks like big query is assigning values correctly, I am not assigning any `_PARTITIONTIME` from dataflow. Below returns correct data. `SELECT * FROM Mytable WHERE _PARTITIONTIME = TIMESTAMP("2017-08-04") ` – PUG Aug 05 '17 at 16:50
0

This is an open source pipeline you can use to connect pub/sub to big query. I think google has also added support for streaming pipelines to date partitioned tables. Details here.

PUG
  • 4,301
  • 13
  • 73
  • 115