I am trying to load a directory of files into a Spark RDD and I need to append the origin file name for each line.
I was not able to find a way to do this with a regular RDD using sc.textFile so I am now attempting to use the WholeTextFiles method to load each file with its file name.
I am using this code:
val lines =
sc.wholeTextFiles(logDir).flatMap{ case (filename, content) =>
val hash = modFiles.md5(filename)
content.split("\n")
.filter(line =>
<filter conditions>
.map(line => line+hash)
}
However this code gives me an Java out of heap memory error, I guess it is trying to load all of the files at once?
Is there a method to solve this by not using wholeTextFiles and/or is there a method to not load all the files at once using wholeTextFiles?