Read few parquet files at the same time in Spark

Question

I can read few json-files at the same time using * (star):

sqlContext.jsonFile('/path/to/dir/*.json')

Is there any way to do the same thing for parquet? Star doesn't works.

score 30 · Answer 1 · answered May 18 '16 at 08:59

30

FYI, you can also:

read subset of parquet files using the wildcard symbol * sqlContext.read.parquet("/path/to/dir/part_*.gz")
read multiple parquet files by explicitly specifying them sqlContext.read.parquet("/path/to/dir/part_1.gz", "/path/to/dir/part_2.gz")

answered May 18 '16 at 08:59

Boris

1,093
2
14
22

2

In addition you can also use a hadoop glob pattern or take advantage of the spark partitioning schema, see https://stackoverflow.com/a/41712465/179014 . – asmaier Sep 12 '17 at 15:58

score 24 · Answer 2 · edited Jul 24 '18 at 03:17

24

InputPath = [hdfs_path + "parquets/date=18-07-23/hour=2*/*.parquet",
             hdfs_path + "parquets/date=18-07-24/hour=0*/*.parquet"]

df = spark.read.parquet(*InputPath)

edited Jul 24 '18 at 03:17

4b0

21,981
30
95
142

answered Jul 24 '18 at 03:09

user6602391

241
2
2

im my case first i filter the files in s3 and then give the list to read.parquet() thanks! – Carlos Gomez Dec 06 '19 at 13:53

score 11 · Accepted Answer · answered May 24 '15 at 16:18

11

See this issue on the spark jira. It is supported from 1.4 onwards.

Without upgrading to 1.4, you could either point at the top level directory:

sqlContext.parquetFile('/path/to/dir/')

which will load all files in the directory. Alternatively, you could use the HDFS API to find the files you want, and pass them to parquetFile (it accepts varargs).

answered May 24 '15 at 16:18

dpeacock

2,697
13
16

3

I get `AttributeError: 'SQLContext' object has no attribute 'parquetFile' ` – Soerendip Oct 11 '18 at 17:51

score 5 · Answer 4 · edited Jan 15 '19 at 11:03

5

For Read: Give the file's path and '*'

Example

pqtDF=sqlContext.read.parquet("Path_*.parquet")

edited Jan 15 '19 at 11:03

Suraj Rao

29,388
11
94
103

answered Jan 15 '19 at 10:57

Idrees

51
1
1

Read few parquet files at the same time in Spark

4 Answers4

Linked