32

I need to read parquet files from multiple paths that are not parent or child directories.

for example,

dir1 ---
       |
       ------- dir1_1
       |
       ------- dir1_2
dir2 ---
       |
       ------- dir2_1
       |
       ------- dir2_2

sqlContext.read.parquet(dir1) reads parquet files from dir1_1 and dir1_2

Right now I'm reading each dir and merging dataframes using "unionAll". Is there a way to read parquet files from dir1_2 and dir2_1 without using unionAll or is there any fancy way using unionAll

Thanks

joshsuihn
  • 770
  • 1
  • 10
  • 25
  • Hi I am a similar task to read multipleJson files but the codes people provided here didnt work :( did you find a solution? – zhifff Mar 11 '20 at 13:51

6 Answers6

64

A little late but I found this while I was searching and it may help someone else...

You might also try unpacking the argument list to spark.read.parquet()

paths=['foo','bar']
df=spark.read.parquet(*paths)

This is convenient if you want to pass a few blobs into the path argument:

basePath='s3://bucket/'
paths=['s3://bucket/partition_value1=*/partition_value2=2017-04-*',
       's3://bucket/partition_value1=*/partition_value2=2017-05-*'
      ]
df=spark.read.option("basePath",basePath).parquet(*paths)

This is cool cause you don't need to list all the files in the basePath, and you still get partition inference.

N00b
  • 781
  • 5
  • 5
  • When I just use this code, it is searching for the directories in /home/ directory, Can you please post the entire syntax? – Viv Jun 19 '17 at 17:38
  • @N00b when I tried this code, it gives me an error that load takes only 4 arguments but i have my paths to 24 files.. is there an option to override this . I am trying not to do multiple loads and a union that is why i would like to use load to put multiple files to a df – E B Jan 19 '18 at 15:42
  • Works perfectly for me! @EB, did you save it as a list and then run it as an expression `(*paths)` ? – thentangler Jul 14 '20 at 21:32
13

Both the parquetFile method of SQLContext and the parquet method of DataFrameReader take multiple paths. So either of these works:

df = sqlContext.parquetFile('/dir1/dir1_2', '/dir2/dir2_1')

or

df = sqlContext.read.parquet('/dir1/dir1_2', '/dir2/dir2_1')
John Conley
  • 388
  • 1
  • 3
  • None of these works for me. It finds "suspicious paths" and then gives me a long list of java stuff. – mic Mar 01 '19 at 23:15
  • Happened to me too. You need to add an option: .option("basePath", "file:///your/path/") as in this answer: https://stackoverflow.com/a/33656595/2851294 – AssafR Feb 08 '22 at 16:23
13

In case you have a list of files you can do:

files = ['file1', 'file2',...]
df = spark.read.parquet(*files)
Russ
  • 3,644
  • 1
  • 13
  • 15
5

For ORC

spark.read.orc("/dir1/*","/dir2/*")

spark goes inside dir1/ and dir2/ folder and load all the ORC files.

For Parquet,

spark.read.parquet("/dir1/*","/dir2/*")
Ganesh
  • 677
  • 8
  • 11
4

Just taking John Conley's answer, and embellishing it a bit and providing the full code (used in Jupyter PySpark) as I found his answer extremely useful.

from hdfs import InsecureClient
client = InsecureClient('http://localhost:50070')

import posixpath as psp
fpaths = [
  psp.join("hdfs://localhost:9000" + dpath, fname)
  for dpath, _, fnames in client.walk('/eta/myHdfsPath')
  for fname in fnames
]
# At this point fpaths contains all hdfs files 

parquetFile = sqlContext.read.parquet(*fpaths)


import pandas
pdf = parquetFile.toPandas()
# display the contents nicely formatted.
pdf
VenVig
  • 645
  • 1
  • 10
  • 14
0

In Spark-Scala you can do this.

val df = spark.read.option("header","true").option("basePath", "s3://bucket/").csv("s3://bucket/{sub-dir1,sub-dir2}/")