1

I have a simple question about spark.

Imagine a file with this data:

00000000000
01000000000
02000000000
00000000000
01000000000
02000000000
03000000000

I want to create a rdd or sparkdataframe that breaks this data based on the lines that starts with 00. So it will be a rdd of string arrays that in this case, based on this example of data, would be something line this:

[00000000000, 01000000000, 02000000000] // first row
[00000000000, 01000000000, 02000000000, 03000000000] // second row

So it would split the data based on the lines starting with 00, and create a array of strings containing all the other lines until it finds another line starting with 00, where the next row of the rdd should start.

I would really appreciate some code example for that.

Thank you.

Tzach Zohar
  • 37,442
  • 3
  • 79
  • 85

0 Answers0