I've got a PCollection where each element is a key, values tuple like this: (key, (value1,..,value_n) )
I need to split this PCollection in two processing branches.
As always, I need the whole pipeline to be as fast and use as little ram as possible.
Two ideas come to my mind:
Option 1: Split the PColl with a DoFn with multiple outputs
class SplitInTwo(beam.DoFn):
def process(self, kvpair):
key, values = kvpair
yield beam.TaggedOutput('left', (key, values[0:2]))
yield beam.TaggedOutput('right', (key, values[2:]))
class ProcessLeft(beam.DoFn):
def process(self, kvpair):
key,values = kvpair
...
yield (key, results)
# class ProcessRight is similar to ProcessLeft
And then build the pipeline like this
splitme = pcoll | beam.ParDo(SplitInTwo()).with_outputs('left','right')
left = splitme.left | beam.ParDo(ProcessLeft())
right = splitme.right | beam.ParDo(ProcessRight())
Option 2: Use two different DoFn on the original PCollection
Another option is using two DoFns to read and process the same PCollection. Just using one for the 'left' and 'right' hand sides of the data:
class ProcessLeft(beam.DoFn):
def process(self, kvpair):
key = kvpair[0]
values = kvpair[0][0:2]
...
yield (key,result)
# class ProcessRight is similar to ProcessLeft
Building the pipleline is simpler... (plus you don't need to track which tagged outputs you have):
left = pcoll | beam.ParDo(ProcessLeft())
right = pcoll| beam.ParDo(ProcessRight())
But... is it faster? will need less memory than the first one?
(I'm thinking about the first option might be fused by the runner - not just a Dataflow runner).