0

I am new to spark and scala. I have below requirement. I need to process all the files under a path which have sub directories. I guess, I need to write a for-loop logic to process across all the files.

Below is the example of my case:

src/proj_fldr/dataset1/20170624/file1.txt
src/proj_fldr/dataset1/20170624/file2.txt
src/proj_fldr/dataset1/20170624/file3.txt
src/proj_fldr/dataset1/20170625/file1.txt
src/proj_fldr/dataset1/20170625/file2.txt
src/proj_fldr/dataset1/20170625/file3.txt
src/proj_fldr/dataset1/20170626/file1.txt
src/proj_fldr/dataset1/20170626/file2.txt
src/proj_fldr/dataset1/20170626/file3.txt
src/proj_fldr/dataset2/20170624/file1.txt
src/proj_fldr/dataset2/20170624/file2.txt
src/proj_fldr/dataset2/20170624/file3.txt
src/proj_fldr/dataset2/20170625/file1.txt
src/proj_fldr/dataset2/20170625/file2.txt
src/proj_fldr/dataset2/20170625/file3.txt
src/proj_fldr/dataset2/20170626/file1.txt
src/proj_fldr/dataset2/20170626/file2.txt
src/proj_fldr/dataset2/20170626/file3.txt

I need the code to iterate the files like In src

   loop (proj_fldr
             loop(dataset
                      loop(datefolder
                                 loop(file1 then, file2....))))
philantrovert
  • 9,904
  • 3
  • 37
  • 61
sri
  • 1
  • 2

1 Answers1

0

Since you have a regular file structure you can use the wildcard * when reading the files. You can do the following to read all the files into a single RDD:

val spark = SparkSession.builder.getOrCreate()
val rdd = spark.sparkContext.wholeTextFiles("src/*/*/*/*.txt")

The result will be a RDD[(String, String)] with the path and the content in a tuple for each processed file.

To explicitly set if you want to use local or HDFS files you can append "hdfs://" or "file://" to the beginning of the path.

Shaido
  • 27,497
  • 23
  • 70
  • 73