Loading files in a loop in spark

Question

I have n number of files in a directory with same .txt extension and I want to load them in a loop and then make separate dataframes for each of them.

I have read this but in my case all my files have same extension and I want to iterate over them one by one and make dataframe for every file.

I started by counting files in a directory with following line of code

sc.wholeTextFiles("/path/to/dir/*.txt").count()

but I don't know how should I proceed further? Please guide me.

I am using Spark 2.3 and Scala.

Thanks.

Why do you want a dataframe for each file? it makes little sense in Spark. Would it not be better if you have only a dataframe where each rows keeps track of the document where it comes from? — Álvaro Valencia, Aug 06 '18 at 17:56

score 1 · Accepted Answer · answered Aug 06 '18 at 19:12

The wholetextiles returns a paired Rdd Function

def wholeTextFiles(path: String, minPartitions: Int): rdd.RDD[(String, String)]

You can do map over the rdd, the key of the rdd is path of the file and value is content of the file

sc.wholeTextFiles("/path/to/dir/*.txt").take(2)

sc.wholeTextFiles("/path/to/dir/*.txt").map((x,y)=> some logic on x and y )

score 0 · Answer 2 · answered Aug 06 '18 at 18:53

You could use the hadoop fs and get the list of files under the directory and then iterate it over and save it to differnet dataframes.

Something like the below:

// Hadoop FS
val hadoop_fs = FileSystem.get(sc1.hadoopConfiguration)

// Get list of part files
val fs_status = hadoop_fs.listLocatedStatus(new Path(fileFullPath))
while (fs_status.hasNext) {

      val fileStatus = fs_status.next.getPath
      val filepath = fileStatus.toString
      val df = sc1.textFile(filepath)
}

Loading files in a loop in spark

2 Answers2