2

I have a directory with some subfolders which content different parquet files. Something like this:

2017-09-05
    10-00
        part00000.parquet
        part00001.parquet
    11-00
        part00000.parquet
        part00001.parquet
    12-00
        part00000.parquet
        part00001.parquet

What I want is by passing the path to the directory 05-09 to get a list of names of all parquet files.

I was able to achieve it, but in a very inefficient way:

 val allParquetFiles = sc.wholeTextFiles("C:/MyDocs/2017-09-05/*/*.parquet")
 allParquetFiles.keys.foreach((k) => println("The path to the file is: "+k))

So each key is the name I am looking for, but this process requires me to load all files as well, which then I can't use, since I get them in binary (and I don't know how to convert them into a dataframe).

Once I have the keys (so the list of filePaths) I am planning to invoke:

val myParquetDF = sqlContext.read.parquet(filePath);

As you may have already understood I am quite new in Spark. So please if there is a faster or easier approach to read a list of parquet files located in different folders, please let me know.

My Partial Solution: I wasn't able to get all paths for all filenames in a folder, but I was able to get the content of all files of that type into the same dataframe. Which was my ultimate goal. In case someone may need it in the future, I used the following line:

val df = sqlContext.read.parquet("C:/MyDocs/2017-05-09/*/*.parquet") 

Thanks for your time

Ignacio Alorre
  • 7,307
  • 8
  • 57
  • 94
  • You can create a map of `filename -> DataFrame` using `sc.wholeTextFiles("path/*/*/").map(x => x._1 -> sqlContext.read.parquet(x._2) )` but it looks extremely weird , specially when there are hunderds of files. – philantrovert Sep 05 '17 at 12:21

2 Answers2

3

You can do it using the hdfs api like this

import org.apache.hadoop.fs._
import org.apache.hadoop.conf._ 
val fs = FileSystem.get(new Configuration())
val files = ( fs.listStatus(new Path("C:/MyDocs/2017-09-05/*/*.parquet")) ).map(_.getPath.toString) 
jeanr
  • 1,031
  • 8
  • 15
2

First, it is better to avoid using wholeTextFiles. This method reads the whole file at once. Try to use textFile method. read more

Second, if you need to get all files recursively in one directory, you can achieve it by textFile method:

sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.input.dir.recursive", "true")

This configuration will enable recursive search (works for spark jobs as for mapreduce jobs). And then just invoke sc.textFile(path).

Natalia
  • 4,362
  • 24
  • 25