I want to understand the difference is use cases between a Map function, a DoFn called from Pardo and a Composite transform.
I could achieve the same results with the below code for a list of transformations that I need to do for my pipeline. I made a sample of what I mean by multiple stages.
import apache_beam as beam
def myTransform(line):
line = line * 10
line = line + 5
line = line - 2
return line
class myPTransform(beam.PTransform):
def expand(self, pcoll):
# return pcoll | beam.Map(myTransform)
pcol_output = (pcoll
| beam.Map(lambda line: line * 10)
| beam.Map(lambda line: line + 5)
| beam.Map(lambda line: line - 2)
)
return pcol_output
class mydofunc(beam.DoFn):
def process(self, element):
element = element * 10
element = element + 5
element = element - 2
yield element
with beam.Pipeline() as p:
lines = p | beam.Create([1,2,3,4,5])
### Map Function
manual = (lines
| "Map function" >> beam.Map(myTransform)
| "Print map" >> beam.Map(print))
### Composite Ptransform
ptrans = (lines
| "ptransform call" >> myPTransform()
| "Print ptransform" >> beam.Map(print))
### Do Function
dofnpcol = (lines
| "Dofn call" >> beam.ParDo(mydofunc())
| "Print dofnpcol" >> beam.Map(print))
On what scenarios should I use a DoFn and a Composite Transform? I might be missing a bigger picture here for the difference between these 3 options. Any insights would be really helpful.
I saw a question on Apache Beam: DoFn vs PTransform