Apache Beam Map, DoFn and Composite Transform

Question

I want to understand the difference is use cases between a Map function, a DoFn called from Pardo and a Composite transform.

I could achieve the same results with the below code for a list of transformations that I need to do for my pipeline. I made a sample of what I mean by multiple stages.

import apache_beam as beam

def myTransform(line):
   line = line * 10
   line = line + 5
   line = line - 2
   return line

class myPTransform(beam.PTransform):
  def expand(self, pcoll):
    # return pcoll | beam.Map(myTransform)
    pcol_output = (pcoll 
                   | beam.Map(lambda line: line * 10)
                   | beam.Map(lambda line: line + 5)
                   | beam.Map(lambda line: line - 2)
    )
    return pcol_output

class mydofunc(beam.DoFn):
   def process(self, element):
    element = element * 10
    element = element + 5
    element = element - 2
    yield element
   
with beam.Pipeline() as p:
    lines = p | beam.Create([1,2,3,4,5])
    
    ### Map Function
    manual = (lines
              | "Map function" >> beam.Map(myTransform)
              | "Print map" >> beam.Map(print))
    
    ### Composite Ptransform
    ptrans = (lines
              | "ptransform call" >> myPTransform()
              | "Print ptransform" >> beam.Map(print))
    
    ### Do Function
    dofnpcol = (lines
              | "Dofn call" >> beam.ParDo(mydofunc())
              | "Print dofnpcol" >> beam.Map(print))

On what scenarios should I use a DoFn and a Composite Transform? I might be missing a bigger picture here for the difference between these 3 options. Any insights would be really helpful.

I saw a question on Apache Beam: DoFn vs PTransform

score 3 · Accepted Answer · answered Feb 23 '23 at 10:38

For this operation, the transformation is done only inside a worker :

   ### Map Function
    manual = (lines
              | "Map function" >> beam.Map(myTransform)
              | "Print map" >> beam.Map(print))

For this second operation, the 3 transformations can be done on multiple workers :

    ### Composite Ptransform
    ptrans = (lines
              | "ptransform call" >> myPTransform()
              | "Print ptransform" >> beam.Map(print))

In this last operation, the result is exactly the same as the first example with Map. Behind the Map operator uses a DoFn, but with the possibility to directly pass a function (or lambda) as parameter. It's interesting to remove boilerplate code and directly call a transformation from an element to another element :

    
    ### Do Function
    dofnpcol = (lines
              | "Dofn call" >> beam.ParDo(mydofunc())
              | "Print dofnpcol" >> beam.Map(print))

The choice depends on different criteria, in your example the transformation isn't expensive and you can use the first and third example.

But in real-world application, big volume and more complexe operations, you can choose the second approach.

The composite PTransform allows to assemble several transformations together and this kind of class and design allow a reusability in other places of the pipeline or in other pipelines.

Thank you @Mazlum. If the volume is too high what is the best option between a Composite Transform and using a DoFn? — Ashok KS, Feb 24 '23 at 01:47
A composite transform is more used to separate the responsability in a different place. For an usual transformation, you can use a DoFn or Map. If you have 2 immutable transformations, you can use 2 DoFn or Map. If these 2 transformations can be reuse together or if you want to separate this responsability, it’s better to create a composite Transform. For one Map versus several Map, it depends if the transformation is heavy. Most of times we compose multiple immutable transformations. — Mazlum Tosun, Feb 24 '23 at 08:53

Apache Beam Map, DoFn and Composite Transform

1 Answers1