Reading multiple csv files at different folder depths

Question

I want to recursively read all csv files in a given folder into a Spark SQL DataFrame using a single path, if possible.

My folder structure looks something like this and I want to include all of the files with one path:

resources/first.csv
resources/subfolder/second.csv
resources/subfolder/third.csv

This is my code:

def read: DataFrame =
      sparkSession
        .read
        .option("header", "true")
        .option("inferSchema", "true")
        .option("charset", "UTF-8")
        .csv(path)

Setting path to .../resource/*/*.csv omits 1. while .../resource/*.csv omits 2. and 3.

I know csv() also takes multiple strings as path arguments, but want to avoid that, if possible.

note: I know my question is similar to How to import multiple csv files in a single load?, except that I want to include files of all contained folders, independent of their location within the main folder.

does a recursive file search solve your problem? http://stackoverflow.com/questions/2637643/how-do-i-list-all-files-in-a-subdirectory-in-scala — Ramón J Romero y Vigil, Mar 27 '17 at 11:47
Technically yes, but as I said, I wanted to avoid having to pass multiple arguments to `csv()`. — NotYanka, Mar 27 '17 at 14:56

L. CWI · Accepted Answer · 2017-03-27T12:54:29.770

11

If there are only csv files and only one level of subfolder in your resources directory then you can use resources/**.

EDIT

Else you can use Hadoop FileSystem class to recursively list every csv files in your resources directory and then pass the list to .csv()

    val fs = FileSystem.get(new Configuration())
    val files = fs.listFiles(new Path("resources/", true))
    val filePaths = new ListBuffer[String]
    while (files.hasNext()) {
        val file = files.next()
        filePaths += file.getPath.toString
    }

    val df: DataFrame = spark
        .read
        .options(...)
        .csv(filePaths: _*)

edited Mar 27 '17 at 12:54

answered Mar 27 '17 at 11:06

L. CWI

952
9
15

Thanks for answering. Yes, there are only csv files. Unfortunately `resources/**` doesn't do the trick: it only retrieves files in `resources/`, but omits files in `resources/subfolder` – NotYanka Mar 27 '17 at 11:38
1

I just realized that I'd temporarily messed up my testing scenario - you were right of course, `/**` _does_ work after all. Sorry for the confusion. :) – NotYanka Mar 28 '17 at 10:26
I had some ambiguity on where the Path was. I didn't know if it was in the default scala, hadoop, java.io, or the java.nio package. All were in the Hadoop package, including Path. You can import them with `import org.apache.hadoop.fs.{FileSystem, Path}` and the Configuration with `import org.apache.hadoop.conf.Configuration`. – Joyoyoyoyoyo Oct 03 '18 at 21:47
I tried using this in the AWS glue job, it didn't work any suggestions? – Kartik Jan 21 '23 at 20:46

Zhixiang.W · Answer 2 · 2022-03-01T07:04:29.267

4

You could use RecursiveFileLookup in spark3 now.

val recursiveLoadedDF = spark.read
  .option("recursiveFileLookup", "true")
  .csv("resources/")

for more reference: recursive-file-lookup

edited Mar 01 '22 at 07:04

answered Jul 20 '21 at 04:43

Zhixiang.W

41
2

I tried using this in the aws glue job, but it didn't work. Any suggestions? Thanks – Kartik Jan 21 '23 at 20:45

Reading multiple csv files at different folder depths

2 Answers2

Linked