1

As a newbie, I created a pipeline with a couple of Transform functions, after having input from a file, it turns the code to lowercase. When I passed that outcome to the next stage it does not appear as a single string but multiple characters. Below is my code:

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions


class ToLower(beam.DoFn):
    def process(self, element):
        #return [{'Data': element.lower()}]
        return element.lower()


class ToReverse(beam.DoFn):
    def process(self, element):
        print(element)
        return element


if __name__ == '__main__':
    in_file = 'news.txt'
    options = PipelineOptions()

    with beam.Pipeline(options=PipelineOptions()) as p:
        r = (
            p | beam.io.ReadFromText(in_file)
            | beam.ParDo(ToLower())
            | beam.ParDo(ToReverse())
        )

Assuming the content of news.txt is below:

Coronavirus cases in Pakistan doubled in one day with total tally at 106 on Monday

When I run the above code it prints the following instead of the reverse:

c
o
r
o
n
a
v
i
r
u
s

c
a
s
e
s

i
n

p
a
k
i
s
t
a
n

d
o
u
b
l
e
d

i
n

o
n
e

d
a
y

w
i
t
h

t
o
t
a
l

t
a
l
l
y

a
t

1
0
6

o
n

m
o
n
d
a
y

And when I change the return in ToLower to return [{'Data': element.lower()}] then it returns as a single string line. What is going on here?

Volatil3
  • 14,253
  • 38
  • 134
  • 263

1 Answers1

0

Based on the Apache Beam Documentation (https://beam.apache.org/documentation/transforms/python/elementwise/pardo/):

enter image description here

What this means is that the Beam framework looks at the output elements and tries to actively convert the output elements as a zero or more elements.

A Python string is a sequence of Unicode code points. Therefore, even though the lower() function returns a single element, that gets converted into a sequence of elements by the Beam framework.

Hope this clarifies.

Few other SO posts that are related:

Difference between beam.ParDo and beam.Map in the output type?
ParDo vs FlatMap in Apache Beam?

user1502505
  • 724
  • 1
  • 8
  • 11