I have a streaming pipeline where I need to query from BigQuery as reference for my pipeline transform. Since BigQuery tables are only changed in 2 weeks, I put the query cache in setup() instead of start_bundle(). From observing logs, I saw that start_bundle() will refresh its value in DoFn life cycle around 50-100 element process but setup() will never be refreshed. Is there any way to deal with this problem?
2 Answers
While you did not provide you code I will answer your question based in your explanation.
First, regarding DoFn.start_bundle(), this function is called for every bundle and it is up for DataFlow to decide the size of these, based on the metrics gathered during execution.
Second,DoFn.setup() is called once per worker. It will be only called again if the worker is restarted. Moreover, as a comparison DoFn.processElement() is called once per element.
Since you need to refresh your query twice per week, it would be the perfect use for SideInput using "Slowly-changing lookup cache". You can use this approach when you have a look up table which changes from time to time. So you need to update the result of the lookup. However, instead of using a single query in batch mode, you can use streaming mode. It allows you to update the result of the lookup (in your case the query's result) based on a GlobalWindow. Afterwards, having this side input you can use it within your main stream PCollection.
Note: I must point that as a limitation sideInputs won't work properly with huge amounts of data (many Gbs or Tb). Furthermore, explanation is very informative.

- 3,892
- 1
- 6
- 13
The above answer is good. As an alternative, you could invoke a method in start_bundle()
returns a cached version of the results as long as it's fresh enough and does a full read from BQ otherwise. See, e.g. Python in-memory cache with time to live

- 4,891
- 18
- 21