How to efficiently read files with path in spark, i.e. want `wholeTextFiles` that returns `RDD[String, Iterator[String]]`

Question

A common problem in Big Data is getting data into Big Data friendly format (parquet or TSV).

In Spark wholeTextFiles which currently returns RDD[(String, String)] (path -> whole file as string) is a useful method for this but causes many issues when the files are large (mainly memory issues).

In principle it ought to be possible to write a method as follows using the underlying Hadoop API

def wholeTextFilesIterators(path: String): RDD[(String, Iterator[String])]

Where the iterator is the file (assuming newline as delimiter) and the iterator is encapsulating the underlying file reading & buffering.

After reading through the code for a while I think a solution would involve creating something similar to WholeTextFileInputFormat and WholeTextFileRecordReader.

UPDATE:

After some thought this probably means also implementing a custom org.apache.hadoop.io.BinaryComparable so the iterator can survive a shuffle (hard to serialise the iterator as it has file handle).

See also https://issues.apache.org/jira/browse/SPARK-22225

Spark-Obtaining file name in RDDs

Why not use Hadoop's plain `TextInputFormat`? It gives you row sharding and streaming out of the box. — Thomas Jungblut, Nov 22 '16 at 12:43

score 0 · Answer 1 · answered Oct 09 '17 at 13:49

0

As per Hyukjin's comment on the JIRA, something close to what is wanted is given by

spark.format("text").read("...").selectExpr("value", "input_file_name()")

answered Oct 09 '17 at 13:49

samthebest

30,803
25
102
142

How to efficiently read files with path in spark, i.e. want `wholeTextFiles` that returns `RDD[String, Iterator[String]]`

1 Answers1