A common problem in Big Data is getting data into Big Data friendly format (parquet or TSV).
In Spark wholeTextFiles
which currently returns RDD[(String, String)]
(path -> whole file as string) is a useful method for this but causes many issues when the files are large (mainly memory issues).
In principle it ought to be possible to write a method as follows using the underlying Hadoop API
def wholeTextFilesIterators(path: String): RDD[(String, Iterator[String])]
Where the iterator is the file (assuming newline as delimiter) and the iterator is encapsulating the underlying file reading & buffering.
After reading through the code for a while I think a solution would involve creating something similar to WholeTextFileInputFormat
and WholeTextFileRecordReader
.
UPDATE:
After some thought this probably means also implementing a custom org.apache.hadoop.io.BinaryComparable
so the iterator can survive a shuffle (hard to serialise the iterator as it has file handle).