Partition rdd from text file while preserving header

Question

For

val rdd = sc.textFile("file.txt")

where file.txt includes

Some Informative Header
value1, value11
value2, value22

how to partition the rdd into

Some Informative Header
value1, value11

and

Some Informative Header
value2, value22

so that I can run rdd.pipe("/bin/awesomeApp") on each partition?

Note Eventually my awesomeApp needs as the very first entry the Some Informative Header, the rest of entries may be computed in parallel.

Possible duplicate of [How to skip header from csv files in Spark?](http://stackoverflow.com/questions/27854919/how-to-skip-header-from-csv-files-in-spark) (Although that question doesn't explicitly ask about preserving the header, some of the answers address that detail) — DNA, May 20 '16 at 12:50
If I may ask @DNA which one ? At least provide the OP with the direct link to that answer. — eliasah, May 20 '16 at 12:58
@eliasah Sure - [this answer](http://stackoverflow.com/a/31202898/699224) shows a way to preserve the header — DNA, May 20 '16 at 13:09
I believe the OP is asking on partitioning the main file into smaller files with header for each. — eliasah, May 20 '16 at 13:10
@echo: is your question on how to divide an RDD into multiple RDDs or repartition the elements of an RDD into different partitions based on content? — vishnu viswanath, May 20 '16 at 16:13
Many Thanks for the question, whichever approach eventually my `awesomeApp` needs as a very first entry the `Some Informative Header`. The rest of the contents may be run in parallel with no dependence from each other. — echo, May 21 '16 at 11:13

score 0 · Answer 1 · answered May 21 '16 at 11:32

The way you want it exactly requires implementation of custom RDD and Partitions, which is not the easiest task. So, if you are agile in choice of output format, you can convert input RDD into key-value one, where key of each row is header:

val rdd = sc.textFile("file.txt")
val header = rdd.take(1)
val lines = rdd.drop(1)
val headerRdd = sc.parralelize(0 to lines.count())
headerRdd.zip(lines)

Partition rdd from text file while preserving header

1 Answers1