21

For context I've never used Beam. I'm trying to understand how to apply the Beam model to common use cases.

Consider you have an unbounded collection of Producers and an unbounded collection of Products such that each Product has a Producer (one to many, Producer to Product). And you have the additional property that a Product's Producer appears before (or shortly after) its Product. But a Producer may appear years before its Product.

If you want to produce an unbounded collection of Products with their Producers joined with them what's the appropriate way to express this? Having a windowed join that stretches out years seems to defeat the point of the window. But having the Producers as a side input doesn't seem to handle that Producers may appear very closely to when the Product appears.

Is there an appropriate way to mix these two concepts?

Steven Noble
  • 10,204
  • 13
  • 45
  • 57

1 Answers1

6

Since Producer may appear years before its Product, you can use some external storage (e.g. BigTable) to store your Producers and write a ParDo for Product stream to do lookups and perform join. To further optimize performance, you can take advantage of stateful DoFn feature to batch lookups (checkout this blog).

You can still use windowing and CoGroupByKey to do join for cases where Product data is delivered before Producer data. However, the window here can be small enough just to handle out-of-order delivery.

Jiayuan Ma
  • 1,891
  • 2
  • 13
  • 25
  • Say you load those Producers from an unbounded source (e.g pubsub) and you are windowing the products (e.g. Hourly) how do you requery the producers needed for that window efficiently? How do you requery them periodically as needed when the main pipeline is streaming? – Davos Jan 07 '19 at 09:47
  • Since BigTable is not yet supported for the Python SDK is it feasible to use BigQuery for stateful processing? – Davos Jan 07 '19 at 13:16