Suppose I have a python dataflow job in GCP that does the following 2 things:
Fetches some data from BigQuery
Calls an external API in order to get a certain value and filters the data from BigQuery based on the fetched value
I am able to do this, however for the second step the only way I figured out how to implement it was to have it as a class that extends DoFn
and call it in a parallel way later:
class CallExternalServiceAndFilter(beam.DoFn):
def to_runner_api_parameter(self, unused_context):
pass
def process(self, element, **kwargs):
# here I have to make the http call and figure out whether to yield the element or not,
# however this happens for each element of the set, as expected.
if element['property'] < response_body_parsed['some_other_property']:
logging.info("Yielding element")
yield element
else:
logging.info("Not yielding element")
with beam.Pipeline(options=PipelineOptions(), argv=argv) as p:
rows = p | 'Read data' >> beam.io.Read(beam.io.BigQuerySource(
dataset='test',
project=PROJECT,
query='Select * from test.table'
))
rows = rows | 'Calling external service and filtering items' >> beam.ParDo(CallExternalServiceAndFilter())
# ...
Is there any way that I can make the API call only once and then use the result in the parallel filtering step?