Did spark dataframes load parquet data lazily?

Question

I want to run sql on my parquet data in spark using the following code,

val parquetDF = spark.read.parquet(path)
parquetDF.createOrReplaceTempView("table_name")
val df = spark.sql("select column_1, column_4, column 10 from table_name");
println(df.count())

My question is, Does this code read only the required columns from the disc?

Theoretically the answer should be Yes. But I need an expert opinion because in the case of Jdbc queries (Mysql), the read(spark.read) phase is taking more time when compared to actions(may be relates to connection but not sure). Jdbc code follows,

spark.read.format("jdbc").jdbc(jdbcUrl, query, props).createOrReplaceTempView(table_name)
spark.sql("select column_1, column_4, column 10 from table_name");
df.show()
println(df.count())

If someone can explain the framework flow in both the cases, it will be very helpful.

Spark version 2.3.0

Scala version 2.11.11

Alper t. Turker · Answer 1 · 2018-05-17T11:04:52.657

2

In both cases Spark will do its best (exact behavior depends on format and version. Depending on the context some optimizations might not be applied, typically with deeply nested data) to limit traffic to only required data. In fact spark.sql("select ...) part is not even relevant, as actual query should be limited to something equivalent to SELECT 1 FROM table, for a given format.

This stays true, as long as you don't use cache / persist. If you do, all optimizations go away, and Spark will load all data eagerly (see my answer to Any performance issues forcing eager evaluation using count in spark? and Caching dataframes while keeping partitions. Also here is an example how execution plan changes when cache is used.).

edited May 17 '18 at 11:04

answered May 17 '18 at 10:58

Alper t. Turker

34,230
9
83
115

So you mean in the first case, only the required columns would be fetched from the disc. ? If so is there any difference in the second case? It should load data from the DB only after the call 'df.count()'. I'm I correct? – Devas May 18 '18 at 05:09
1

`count` doesn't require the data so it is not a good example, but otherwise yes. There is no difference between JDBC or Parquet source. – Alper t. Turker May 18 '18 at 09:07
yes, count may not be a good example. Again not sure why the read(spark.read) phase is taking more time – Devas May 21 '18 at 06:42
Query optimization is only small fraction of the overall performance. There many more factors here (for starters default JDBC behavior is sequential). – Alper t. Turker May 21 '18 at 09:09

Did spark dataframes load parquet data lazily?

1 Answers1