0

I'm trying to write a dataflow job that needs to process logs located on storage and write them in different BigQuery tables. Which output tables are going to be used depends on the records in the logs. So I do some processing on the logs and yield them with a key based on a value in the log. After which I group the logs on the keys. I need to write all the logs grouped on the same key to a table.

I'm trying to use the beam.io.gcp.bigquery.WriteToBigQuery module with a callable as the table argument as described in the documentation here

I would like to use a date-partitioned table as this will easily allow me to write_truncate on the different partitions.

Now I encounter 2 main problems:

  • The CREATE_IF_NEEDED gives an error because it has to create a partitioned table. I can circumvent this by making sure the tables exist in a previous step and if not create them.
  • If i load older data I get the following error:
The destination table's partition table_name_x$20190322 is outside the allowed bounds. You can only stream to partitions within 31 days in the past and 16 days in the future relative to the current date."

This seems like a limitation of streaming inserts, any way to do batch inserts ?

Maybe I'm approaching this wrong, and should use another method. Any guidance as how to tackle these issues are appreciated.

Im using python 3.5 and apache-beam=2.13.0

Georges Lorré
  • 443
  • 3
  • 11

1 Answers1

3

That error message can be logged when one mixes the use of an ingestion-time partitioned table a column-partitioned table (see this similar issue). Summarizing from the link, it is not possible to use column-based partitioning (not ingestion-time partitioning) and write to tables with partition suffixes.

In your case, since you want to write to different tables based on a value in the log and have partitions within each table, forgo the use of the partition decorator when selecting which table (use "[prefix]_YYYYMMDD") and then have each individual table be column-based partitioned.

Cubez
  • 878
  • 5
  • 11
  • I'm not sure how I'm mixing the 2 kinds of partitioning. My tables are ingestion-time partitioned (pseudo column _PARTITIONTIME) and I'm using the the decorator to load data in the specified partition. This works just fine if I stay within the `31 days in the past and 16 days in the future` limit of streaming inserts. But it was my understanding the `WriteToBigQuery` module would use batch inserts (FILE_LOADS) method and thus avoid the streaming timeframe limitation. I have also considered using column-based partitioning, but then it is harder to avoid duplicates when rerunning jobs. – Georges Lorré Jul 10 '19 at 08:20
  • Are you using the "--experiments use_beam_bq_sink" flag? To enable the feature to switch between streaming inserts and batch inserts the SDK requires this experiment to be passed in. edit: I just found a similar question in https://stackoverflow.com/questions/43505534/writing-different-values-to-different-bigquery-tables-in-apache-beam. But for your use case, enabling the flag might fix it. – Cubez Jul 10 '19 at 18:07