0

I have n number of files in a directory with same .txt extension and I want to load them in a loop and then make separate dataframes for each of them.

I have read this but in my case all my files have same extension and I want to iterate over them one by one and make dataframe for every file.

I started by counting files in a directory with following line of code

sc.wholeTextFiles("/path/to/dir/*.txt").count()

but I don't know how should I proceed further? Please guide me.

I am using Spark 2.3 and Scala.

Thanks.

M.Ahsen Taqi
  • 965
  • 11
  • 35
  • 1
    Why do you want a dataframe for each file? it makes little sense in Spark. Would it not be better if you have only a dataframe where each rows keeps track of the document where it comes from? – Álvaro Valencia Aug 06 '18 at 17:56

2 Answers2

1

The wholetextiles returns a paired Rdd Function

def wholeTextFiles(path: String, minPartitions: Int): rdd.RDD[(String, String)]

You can do map over the rdd, the key of the rdd is path of the file and value is content of the file

sc.wholeTextFiles("/path/to/dir/*.txt").take(2)

sc.wholeTextFiles("/path/to/dir/*.txt").map((x,y)=> some logic on x and y )
loneStar
  • 3,780
  • 23
  • 40
0

You could use the hadoop fs and get the list of files under the directory and then iterate it over and save it to differnet dataframes.

Something like the below:

// Hadoop FS
val hadoop_fs = FileSystem.get(sc1.hadoopConfiguration)

// Get list of part files
val fs_status = hadoop_fs.listLocatedStatus(new Path(fileFullPath))
while (fs_status.hasNext) {

      val fileStatus = fs_status.next.getPath
      val filepath = fileStatus.toString
      val df = sc1.textFile(filepath)
}
K S Nidhin
  • 2,622
  • 2
  • 22
  • 44