15

I can read few json-files at the same time using * (star):

sqlContext.jsonFile('/path/to/dir/*.json')

Is there any way to do the same thing for parquet? Star doesn't works.

SkyFox
  • 1,805
  • 4
  • 22
  • 33

4 Answers4

30

FYI, you can also:

  • read subset of parquet files using the wildcard symbol * sqlContext.read.parquet("/path/to/dir/part_*.gz")

  • read multiple parquet files by explicitly specifying them sqlContext.read.parquet("/path/to/dir/part_1.gz", "/path/to/dir/part_2.gz")

Boris
  • 1,093
  • 2
  • 14
  • 22
  • 2
    In addition you can also use a hadoop glob pattern or take advantage of the spark partitioning schema, see https://stackoverflow.com/a/41712465/179014 . – asmaier Sep 12 '17 at 15:58
24
InputPath = [hdfs_path + "parquets/date=18-07-23/hour=2*/*.parquet",
             hdfs_path + "parquets/date=18-07-24/hour=0*/*.parquet"]

df = spark.read.parquet(*InputPath)
4b0
  • 21,981
  • 30
  • 95
  • 142
user6602391
  • 241
  • 2
  • 2
11

See this issue on the spark jira. It is supported from 1.4 onwards.

Without upgrading to 1.4, you could either point at the top level directory:

sqlContext.parquetFile('/path/to/dir/')

which will load all files in the directory. Alternatively, you could use the HDFS API to find the files you want, and pass them to parquetFile (it accepts varargs).

dpeacock
  • 2,697
  • 13
  • 16
5

For Read: Give the file's path and '*'

Example

pqtDF=sqlContext.read.parquet("Path_*.parquet")
Suraj Rao
  • 29,388
  • 11
  • 94
  • 103
Idrees
  • 51
  • 1
  • 1