1

A common problem in Big Data is getting data into Big Data friendly format (parquet or TSV).

In Spark wholeTextFiles which currently returns RDD[(String, String)] (path -> whole file as string) is a useful method for this but causes many issues when the files are large (mainly memory issues).

In principle it ought to be possible to write a method as follows using the underlying Hadoop API

def wholeTextFilesIterators(path: String): RDD[(String, Iterator[String])]

Where the iterator is the file (assuming newline as delimiter) and the iterator is encapsulating the underlying file reading & buffering.

After reading through the code for a while I think a solution would involve creating something similar to WholeTextFileInputFormat and WholeTextFileRecordReader.

UPDATE:

After some thought this probably means also implementing a custom org.apache.hadoop.io.BinaryComparable so the iterator can survive a shuffle (hard to serialise the iterator as it has file handle).

See also https://issues.apache.org/jira/browse/SPARK-22225

Spark-Obtaining file name in RDDs

samthebest
  • 30,803
  • 25
  • 102
  • 142

1 Answers1

0

As per Hyukjin's comment on the JIRA, something close to what is wanted is given by

spark.format("text").read("...").selectExpr("value", "input_file_name()")
samthebest
  • 30,803
  • 25
  • 102
  • 142