I know that when we implement a ParDo transform, we pick up individual elements from our data(basically separated by "\n"). But what if I have an element that occupies two lines in my file. Can I apply my own condition to pick elements according to it? Or is it always necessary to have an element in a single line?
Asked
Active
Viewed 323 times
1 Answers
1
Reading of text files is controlled by TextIO
, not by ParDo
- I suppose that's what you meant. Indeed right now TextIO
splits files into 1 element per line, however there is work in progress on changing that. You can follow the work at https://issues.apache.org/jira/browse/BEAM-2802.
It would be useful for that work, if you told more about your file format, to make sure it is in scope.

jkff
- 17,623
- 5
- 53
- 85
-
Hi @jkff... I totally forgot about this...yes so we have a .sql file that naturally has queries occupying multiple lines. When I tried to read them within my dataflow program, the queries in the resulting PCollection were not in order as stated in one of my posts that you answered - https://stackoverflow.com/questions/45920895/read-a-file-from-gcs-in-apache-beam. So basically we were trying to execute all queries in that file in sequence using dataflow. – rish0097 Sep 11 '17 at 09:40
-
If you're executing queries in sequence, i.e. not in parallel, why do you need Dataflow? :) – jkff Sep 11 '17 at 17:14
-
You're right Dataflow is not required for sequential execution but we have a batch job which involves a lot of steps and executing the queries in sequence is one of the steps in that job. So had to include that as well. – rish0097 Sep 12 '17 at 05:01
-
You can do any custom sequential code within a beam pipeline eg inside a ProcessElement method. If you just need to do something once, you can Create.of() a single element collection and apply a ParDo to it. – jkff Sep 13 '17 at 02:59
-
Right. Thanks @jkff!! – rish0097 Sep 27 '17 at 04:51