Read file path available in a dataframe and read content of those files using spark

Question

I have a data frame as follows. It contains hdfs file path. I would like to read the values and then read the contents of the file. What is the best way to solve this without any nested RDDs leveraging parallel processing. I am using Scala 2.11 and Spark 2.1

+--------------------+
|               value|
+--------------------+
|hdfs://61.81.70.1...|
|hdfs://61.81.70.1...|
|hdfs://61.81.70.1...|
|hdfs://61.81.70.1...|
+--------------------+

Edit based on Ankush answer: The files are huge and can't be read using wholeTextFiles

Thank you

Collect it into an Array of Strings and then map it with `sc.textFile`. You should have an array of RDDs — philantrovert, Aug 09 '17 at 08:46

score 0 · Answer 1 · answered Aug 08 '17 at 21:28

0

You could use

sc.wholeTextFiles("path/to/all/file")

doc link for reference

It generates a Pair RDD with key => filepath and value => content of file

Hope it helps!

answered Aug 08 '17 at 21:28

Ankush Singh

560
7
17

Thank you for the response. I have to read the file contents line by line. From the docs I see wholeTextFiles method read the entire content as single record. Doc also says it gives bad performance for big files – Satheesh Aug 08 '17 at 21:36
you could use map of your dataframe [link](https://stackoverflow.com/questions/37108980/how-to-read-a-file-from-hdfs-in-map-quickly-with-spark ) – Ankush Singh Aug 08 '17 at 22:05
I 100% agree with you. But from the docs **Small files are preferred, large file is also allowable, but may cause bad performance.** – Satheesh Aug 08 '17 at 22:13

Read file path available in a dataframe and read content of those files using spark

1 Answers1