3

I am hoping to generate an explain/execution plan in Spark 2.2 with some actions on a dataframe. The goal here is to ensure that partition pruning is occurring as expected before I kick off the job and consume cluster resources. I tried a Spark documentation search and a SO search here but couldn't find a syntax that worked for my situation.

Here is a simple example that works as expected:

scala> List(1, 2, 3, 4).toDF.explain
== Physical Plan ==
LocalTableScan [value#42]

Here's an example that is not working as expected but hoping to get to work:

scala> List(1, 2, 3, 4).toDF.count.explain
<console>:24: error: value explain is not a member of Long
List(1, 2, 3, 4).toDF.count.explain
                               ^

And here's a more detailed example to further exhibit the end goal of the partition pruning that I am hoping to confirm via explain plan.

val newDf = spark.read.parquet(df).filter(s"start >= ${startDt}").filter(s"start <= ${endDt}")

Thanks in advance for any thoughts/feedback.

user9074332
  • 2,336
  • 2
  • 23
  • 39
  • 1
    `explain` is Spark SQL function offered as part of the Dataset/DataFrame API. When you do something like `count` or `collect`, it returns `Long` and `Array` types respectively which doesn't have the `explain` as their member. – Sivaprasanna Sethuraman May 29 '18 at 05:27

1 Answers1

3

count method is eagerly evaluated and as you see returns Long, so there is no execution plan available.

You have to use a lazy transformation, either:

import org.apache.spark.sql.functions.count

df.select(count($"*"))

or

df.groupBy().agg(count($"*"))
Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115