Reading parquet files from multiple directories in Pyspark

Question

I need to read parquet files from multiple paths that are not parent or child directories.

for example,

dir1 ---
       |
       ------- dir1_1
       |
       ------- dir1_2
dir2 ---
       |
       ------- dir2_1
       |
       ------- dir2_2

sqlContext.read.parquet(dir1) reads parquet files from dir1_1 and dir1_2

Right now I'm reading each dir and merging dataframes using "unionAll". Is there a way to read parquet files from dir1_2 and dir2_1 without using unionAll or is there any fancy way using unionAll

Thanks

Hi I am a similar task to read multipleJson files but the codes people provided here didnt work :( did you find a solution? — zhifff, Mar 11 '20 at 13:51

score 64 · Answer 1 · answered May 10 '17 at 00:03

64

A little late but I found this while I was searching and it may help someone else...

You might also try unpacking the argument list to spark.read.parquet()

paths=['foo','bar']
df=spark.read.parquet(*paths)

This is convenient if you want to pass a few blobs into the path argument:

basePath='s3://bucket/'
paths=['s3://bucket/partition_value1=*/partition_value2=2017-04-*',
       's3://bucket/partition_value1=*/partition_value2=2017-05-*'
      ]
df=spark.read.option("basePath",basePath).parquet(*paths)

This is cool cause you don't need to list all the files in the basePath, and you still get partition inference.

answered May 10 '17 at 00:03

N00b

781
5
5

When I just use this code, it is searching for the directories in /home/ directory, Can you please post the entire syntax? – Viv Jun 19 '17 at 17:38
@N00b when I tried this code, it gives me an error that load takes only 4 arguments but i have my paths to 24 files.. is there an option to override this . I am trying not to do multiple loads and a union that is why i would like to use load to put multiple files to a df – E B Jan 19 '18 at 15:42
Works perfectly for me! @EB, did you save it as a list and then run it as an expression `(*paths)` ? – thentangler Jul 14 '20 at 21:32

score 13 · Answer 2 · answered May 17 '16 at 06:37

13

Both the parquetFile method of SQLContext and the parquet method of DataFrameReader take multiple paths. So either of these works:

df = sqlContext.parquetFile('/dir1/dir1_2', '/dir2/dir2_1')

or

df = sqlContext.read.parquet('/dir1/dir1_2', '/dir2/dir2_1')

answered May 17 '16 at 06:37

John Conley

388
1
3

None of these works for me. It finds "suspicious paths" and then gives me a long list of java stuff. – mic Mar 01 '19 at 23:15
Happened to me too. You need to add an option: .option("basePath", "file:///your/path/") as in this answer: https://stackoverflow.com/a/33656595/2851294 – AssafR Feb 08 '22 at 16:23

score 13 · Answer 3 · answered Jul 10 '20 at 22:44

13

In case you have a list of files you can do:

files = ['file1', 'file2',...]
df = spark.read.parquet(*files)

answered Jul 10 '20 at 22:44

Russ

3,644
1
13
15

* is the trick. When you are dealing with programs that generates list of dirs, this is most suitable way to read the value from that list vairable. – Alex Raj Kaliamoorthy Jan 17 '23 at 13:50

score 5 · Answer 4 · answered Apr 30 '20 at 18:39

5

For ORC

spark.read.orc("/dir1/*","/dir2/*")

spark goes inside dir1/ and dir2/ folder and load all the ORC files.

For Parquet,

spark.read.parquet("/dir1/*","/dir2/*")

answered Apr 30 '20 at 18:39

Ganesh

677
8
11

score 4 · Answer 5 · answered Oct 26 '17 at 17:57

Just taking John Conley's answer, and embellishing it a bit and providing the full code (used in Jupyter PySpark) as I found his answer extremely useful.

from hdfs import InsecureClient
client = InsecureClient('http://localhost:50070')

import posixpath as psp
fpaths = [
  psp.join("hdfs://localhost:9000" + dpath, fname)
  for dpath, _, fnames in client.walk('/eta/myHdfsPath')
  for fname in fnames
]
# At this point fpaths contains all hdfs files 

parquetFile = sqlContext.read.parquet(*fpaths)


import pandas
pdf = parquetFile.toPandas()
# display the contents nicely formatted.
pdf

score 0 · Answer 6 · answered Jun 30 '22 at 09:28

0

In Spark-Scala you can do this.

val df = spark.read.option("header","true").option("basePath", "s3://bucket/").csv("s3://bucket/{sub-dir1,sub-dir2}/")

answered Jun 30 '22 at 09:28

Maneesh K Bishnoi

131
1
7

Reading parquet files from multiple directories in Pyspark

6 Answers6

Linked

Related