Writing to a BigQuery table with date in table name from a DataFlow streaming pipeline

Question

My table name format: tableName_YYYYMMDD. I am trying to write to this table from a streaming dataflow pipeline. The reason I want to write to a new table everyday is because I want to expire tables after 30 days and only want to keep a window of 30 tables at a time.

Current code:

tableRow.apply(BigQueryIO.Write
                .named("WriteBQTable")
                .to(String.format("%1$s:%2$s.%3$s",projectId, bqDataSet, bqTable))
                .withSchema(schema)
                .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
                .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));

I do realize above code will not roll over to new day and start writing there.

As this answer suggests I can partition table and expire partitions, but writing to a partitioned tables seems like is not supported from a streaming pipeline.

Any ideas how can I work around this?

score 0 · Answer 1 · edited Aug 04 '17 at 23:52

0

In the Dataflow 2.0 SDK there is a way to specify DynamicDestinations

See to(DynamicDestinations<T,?> dynamicDestinations) in BigQuery Dynamic Destionations.

Also, see the TableDestination version, which should be simpler and less code. Though unfortunately there is no example in the javadoc.

to(SerializableFunction<ValueInSingleWindow<T>,TableDestination> tableFunction)

https://beam.apache.org/documentation/sdks/javadoc/2.0.0/

edited Aug 04 '17 at 23:52

Pablo

10,425
1
44
67

answered Aug 04 '17 at 23:32

Alex Amato

1,685
10
15

I have been testing this with a partitioned table writing from dataflow, looks like big query is assigning values correctly, I am not assigning any `_PARTITIONTIME` from dataflow. Below returns correct data. `SELECT * FROM Mytable WHERE _PARTITIONTIME = TIMESTAMP("2017-08-04") ` – PUG Aug 05 '17 at 16:50

score 0 · Answer 2 · answered Aug 08 '17 at 22:22

0

This is an open source pipeline you can use to connect pub/sub to big query. I think google has also added support for streaming pipelines to date partitioned tables. Details here.

answered Aug 08 '17 at 22:22

PUG

4,301
13
73
115

Writing to a BigQuery table with date in table name from a DataFlow streaming pipeline

2 Answers2