Efficient way to read specific columns from parquet file in spark

Question

What is the most efficient way to read only a subset of columns in spark from a parquet file that has many columns? Is using spark.read.format("parquet").load(<parquet>).select(...col1, col2) the best way to do that? I would also prefer to use typesafe dataset with case classes to pre-define my schema but not sure.

Oli · Accepted Answer · 2022-04-14T20:46:35.830

32

val df = spark.read.parquet("fs://path/file.parquet").select(...)

This will only read the corresponding columns. Indeed, parquet is a columnar storage and it is exactly meant for this type of use case. Try running df.explain and spark will tell you that only the corresponding columns are read (it prints the execution plan). explain would also tell you what filters are pushed down to the physical plan of execution in case you also use a where condition. Finally use the following code to convert the dataframe (dataset of rows) to a dataset of your case class.

case class MyData...
val ds = df.as[MyData]

edited Apr 14 '22 at 20:46

answered Jan 24 '18 at 12:28

Oli

9,766
5
25
46

what is '...' ? – Joe Oct 03 '19 at 15:25
It means that it's up to you to define the case class that fits your data. For instance `case class MyData(col1: Int, col2: String)` – Oli Oct 03 '19 at 22:56
select(col1,col2,...) where col1 and col2 are strings representing the column names – Cr4zyTun4 Jan 17 '22 at 16:46

score 7 · Answer 2 · edited Jan 30 '22 at 10:03

At least in some cases getting dataframe with all columns + selecting a subset won't work. E.g. the following will fail if parquet contains at least one field with type that is not supported by Spark:

spark.read.format("parquet").load("<path_to_file>").select("col1", "col2")

One solution is to provide schema that contains only requested columns to load:

spark.read.format("parquet").load("<path_to_file>",
                                   schema="col1 bigint, col2 float")

Using this you will be able to load a subset of Spark-supported parquet columns even if loading the full file is not possible. I'm using pyspark here, but would expect Scala version to have something similar.

Having the `"col1", "col2"` filled in instead of `...` made this a slightly more useful/practical answer than the currently accepted answer for me. — DaReal, Mar 01 '22 at 21:20

score 6 · Answer 3 · answered Jan 24 '18 at 12:21

6

Spark supports pushdowns with Parquet so

load(<parquet>).select(...col1, col2)

is fine.

I would also prefer to use typesafe dataset with case classes to pre-define my schema but not sure.

This could be an issue, as it looks like some optimizations don't work in this context Spark 2.0 Dataset vs DataFrame

answered Jan 24 '18 at 12:21

Alper t. Turker

34,230
9
83
115

what is '...' ? – Joe Oct 03 '19 at 15:26

moriarty007 · Answer 4 · 2018-07-31T13:46:51.797

2

Parquet is a columnar file format. It is exactly designed for these kind of use cases.

val df = spark.read.parquet("<PATH_TO_FILE>").select(...)

should do the job for you.

edited Jul 31 '18 at 13:46

answered Mar 20 '18 at 19:53

moriarty007

2,054
16
20

1

what is '...' ? – Joe Oct 03 '19 at 15:25
'...' is the column names to load. – Greg Nov 23 '22 at 14:30

Efficient way to read specific columns from parquet file in spark

4 Answers4

Linked

Related