Spark - Get from a directory with nested folders all filenames of a particular data type

Question

I have a directory with some subfolders which content different parquet files. Something like this:

2017-09-05
    10-00
        part00000.parquet
        part00001.parquet
    11-00
        part00000.parquet
        part00001.parquet
    12-00
        part00000.parquet
        part00001.parquet

What I want is by passing the path to the directory 05-09 to get a list of names of all parquet files.

I was able to achieve it, but in a very inefficient way:

 val allParquetFiles = sc.wholeTextFiles("C:/MyDocs/2017-09-05/*/*.parquet")
 allParquetFiles.keys.foreach((k) => println("The path to the file is: "+k))

So each key is the name I am looking for, but this process requires me to load all files as well, which then I can't use, since I get them in binary (and I don't know how to convert them into a dataframe).

Once I have the keys (so the list of filePaths) I am planning to invoke:

val myParquetDF = sqlContext.read.parquet(filePath);

As you may have already understood I am quite new in Spark. So please if there is a faster or easier approach to read a list of parquet files located in different folders, please let me know.

My Partial Solution: I wasn't able to get all paths for all filenames in a folder, but I was able to get the content of all files of that type into the same dataframe. Which was my ultimate goal. In case someone may need it in the future, I used the following line:

val df = sqlContext.read.parquet("C:/MyDocs/2017-05-09/*/*.parquet")

Thanks for your time

You can create a map of `filename -> DataFrame` using `sc.wholeTextFiles("path/*/*/").map(x => x._1 -> sqlContext.read.parquet(x._2) )` but it looks extremely weird , specially when there are hunderds of files. — philantrovert, Sep 05 '17 at 12:21

score 3 · Answer 1 · answered Sep 05 '17 at 11:12

3

You can do it using the hdfs api like this

import org.apache.hadoop.fs._
import org.apache.hadoop.conf._ 
val fs = FileSystem.get(new Configuration())
val files = ( fs.listStatus(new Path("C:/MyDocs/2017-09-05/*/*.parquet")) ).map(_.getPath.toString)

answered Sep 05 '17 at 11:12

jeanr

1,031
8
15

thanks for the reply, do you know the import for Path? I am looking for it, but if you already know it, please add it as well – Ignacio Alorre Sep 05 '17 at 11:19
Path is in org.apache.hadoop.fs – jeanr Sep 05 '17 at 11:22
Ok but with this solution the wildcard of the subfolders doesnt work. I am getting: FileNotFoundException for the path – Ignacio Alorre Sep 05 '17 at 11:28
In fact, I have already use this solution but with any wildcard in the Path. Here is a more complex answer of your question https://stackoverflow.com/q/24647992/6138873 – jeanr Sep 05 '17 at 11:33
But I need the wildcard, because I dont know how many subfolders may be and their names. – Ignacio Alorre Sep 05 '17 at 12:09

score 2 · Answer 2 · answered Sep 05 '17 at 10:56

2

First, it is better to avoid using wholeTextFiles. This method reads the whole file at once. Try to use textFile method. read more

Second, if you need to get all files recursively in one directory, you can achieve it by textFile method:

sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive", "true")

This configuration will enable recursive search (works for spark jobs as for mapreduce jobs). And then just invoke sc.textFile(path).

answered Sep 05 '17 at 10:56

Natalia

4,362
24
25

But with that am I getting the file or the file name? – Ignacio Alorre Sep 05 '17 at 11:00
@IgnacioAlorre you get the RDD with content of all files – Natalia Sep 05 '17 at 13:24

Spark - Get from a directory with nested folders all filenames of a particular data type

2 Answers2