0

I have a data frame as follows. It contains hdfs file path. I would like to read the values and then read the contents of the file. What is the best way to solve this without any nested RDDs leveraging parallel processing. I am using Scala 2.11 and Spark 2.1

+--------------------+
|               value|
+--------------------+
|hdfs://61.81.70.1...|
|hdfs://61.81.70.1...|
|hdfs://61.81.70.1...|
|hdfs://61.81.70.1...|
+--------------------+

Edit based on Ankush answer: The files are huge and can't be read using wholeTextFiles

Thank you

Satheesh
  • 1
  • 3

1 Answers1

0

You could use

sc.wholeTextFiles("path/to/all/file")

doc link for reference

It generates a Pair RDD with key => filepath and value => content of file

Hope it helps!

Ankush Singh
  • 560
  • 7
  • 17
  • Thank you for the response. I have to read the file contents line by line. From the docs I see wholeTextFiles method read the entire content as single record. Doc also says it gives bad performance for big files – Satheesh Aug 08 '17 at 21:36
  • you could use map of your dataframe [link](https://stackoverflow.com/questions/37108980/how-to-read-a-file-from-hdfs-in-map-quickly-with-spark ) – Ankush Singh Aug 08 '17 at 22:05
  • I 100% agree with you. But from the docs **Small files are preferred, large file is also allowable, but may cause bad performance.** – Satheesh Aug 08 '17 at 22:13