Spark RDD based on Multiple lines of a file

Asked Apr 13 '18 at 19:56

Active Apr 14 '18 at 05:54

Viewed 36 times

I have a simple question about spark.

Imagine a file with this data:

00000000000
01000000000
02000000000
00000000000
01000000000
02000000000
03000000000

I want to create a rdd or sparkdataframe that breaks this data based on the lines that starts with 00. So it will be a rdd of string arrays that in this case, based on this example of data, would be something line this:

[00000000000, 01000000000, 02000000000] // first row
[00000000000, 01000000000, 02000000000, 03000000000] // second row

So it would split the data based on the lines starting with 00, and create a array of strings containing all the other lines until it finds another line starting with 00, where the next row of the rdd should start.

I would really appreciate some code example for that.

Thank you.

edited Apr 13 '18 at 20:01

Tzach Zohar

37,442
3
79
85

asked Apr 13 '18 at 19:56

Pedro Kássio

Spark RDD based on Multiple lines of a file

0 Answers0