scala loop through multiple files in the path

Question

I am new to spark and scala. I have below requirement. I need to process all the files under a path which have sub directories. I guess, I need to write a for-loop logic to process across all the files.

Below is the example of my case:

src/proj_fldr/dataset1/20170624/file1.txt
src/proj_fldr/dataset1/20170624/file2.txt
src/proj_fldr/dataset1/20170624/file3.txt
src/proj_fldr/dataset1/20170625/file1.txt
src/proj_fldr/dataset1/20170625/file2.txt
src/proj_fldr/dataset1/20170625/file3.txt
src/proj_fldr/dataset1/20170626/file1.txt
src/proj_fldr/dataset1/20170626/file2.txt
src/proj_fldr/dataset1/20170626/file3.txt
src/proj_fldr/dataset2/20170624/file1.txt
src/proj_fldr/dataset2/20170624/file2.txt
src/proj_fldr/dataset2/20170624/file3.txt
src/proj_fldr/dataset2/20170625/file1.txt
src/proj_fldr/dataset2/20170625/file2.txt
src/proj_fldr/dataset2/20170625/file3.txt
src/proj_fldr/dataset2/20170626/file1.txt
src/proj_fldr/dataset2/20170626/file2.txt
src/proj_fldr/dataset2/20170626/file3.txt

I need the code to iterate the files like In src

   loop (proj_fldr
             loop(dataset
                      loop(datefolder
                                 loop(file1 then, file2....))))

https://stackoverflow.com/questions/24029873/how-to-read-multiple-text-files-into-a-single-rdd — undefined_variable, Jun 26 '17 at 06:12

Shaido · Answer 1 · 2017-08-01T06:26:34.507

Since you have a regular file structure you can use the wildcard * when reading the files. You can do the following to read all the files into a single RDD:

val spark = SparkSession.builder.getOrCreate()
val rdd = spark.sparkContext.wholeTextFiles("src/*/*/*/*.txt")

The result will be a RDD[(String, String)] with the path and the content in a tuple for each processed file.

To explicitly set if you want to use local or HDFS files you can append "hdfs://" or "file://" to the beginning of the path.

scala loop through multiple files in the path

1 Answers1