Pick elements in processElement() - Apache Beam

Question

I know that when we implement a ParDo transform, we pick up individual elements from our data(basically separated by "\n"). But what if I have an element that occupies two lines in my file. Can I apply my own condition to pick elements according to it? Or is it always necessary to have an element in a single line?

score 1 · Answer 1 · answered Aug 29 '17 at 15:35

1

Reading of text files is controlled by TextIO, not by ParDo - I suppose that's what you meant. Indeed right now TextIO splits files into 1 element per line, however there is work in progress on changing that. You can follow the work at https://issues.apache.org/jira/browse/BEAM-2802.

It would be useful for that work, if you told more about your file format, to make sure it is in scope.

answered Aug 29 '17 at 15:35

jkff

17,623
5
53
85

Hi @jkff... I totally forgot about this...yes so we have a .sql file that naturally has queries occupying multiple lines. When I tried to read them within my dataflow program, the queries in the resulting PCollection were not in order as stated in one of my posts that you answered - https://stackoverflow.com/questions/45920895/read-a-file-from-gcs-in-apache-beam. So basically we were trying to execute all queries in that file in sequence using dataflow. – rish0097 Sep 11 '17 at 09:40
If you're executing queries in sequence, i.e. not in parallel, why do you need Dataflow? :) – jkff Sep 11 '17 at 17:14
You're right Dataflow is not required for sequential execution but we have a batch job which involves a lot of steps and executing the queries in sequence is one of the steps in that job. So had to include that as well. – rish0097 Sep 12 '17 at 05:01
You can do any custom sequential code within a beam pipeline eg inside a ProcessElement method. If you just need to do something once, you can Create.of() a single element collection and apply a ParDo to it. – jkff Sep 13 '17 at 02:59
Right. Thanks @jkff!! – rish0097 Sep 27 '17 at 04:51

Pick elements in processElement() - Apache Beam

1 Answers1

Linked