Understanding Optimized Logical Plan

Question

I'm reading this question and I'm trying to reproduce the problem.

However I get a completely deifferent optimized plan than the op's one.

Op's code:

spark.udf.register("inc", (x: Long) => x + 1)
val df = spark.sql("select sum(inc(vals)) from data")
df.explain(true)
df.show()

Op's optimized plan:

== Optimized Logical Plan ==
Aggregate [sum(inc(vals#4L)) AS sum(inc(vals))#7L]
+- LocalRelation [vals#4L]

My code:

val df1 = spark.read
  .option("sep", "\t")
  .option("inferSchema", "true")
  .option("header", "true")
  .csv("D:/playground/pylearner.tsv")
    .withColumn("Month_year",trim($"Month_year"))

df1.createOrReplaceTempView("pylearner")

df1.show()

spark.udf.register("inc", (x: Long) => x + 1)
val df = spark.sql("select sum(inc(ID)) from pylearner")
df.explain(true)
df.show()

My optimized plan:

== Optimized Logical Plan ==
Aggregate [sum(if (isnull(cast(ID#10 as bigint))) null else UDF:inc(cast(ID#10 as bigint))) AS sum(UDF:inc(cast(ID as bigint)))#45L]
+- Project [ID#10]
   +- Relation[ID#10,TYPE_ID#11,Month_year#12,Amount#13] csv

who is Op? Anyways, I have seen this too. I cached the df and then took the plan to get correct plan. — Salim, Jan 24 '20 at 22:21

Jasper-M · Accepted Answer · 2020-01-27T15:55:02.893

2

OP is converting a local Seq[Long] to a DataFrame. You are reading a csv file from disk. Hence OP has a LocalRelation, while you have some extra stuff for reading in a csv, selecting the ID column out of it, and handling possibly missing (null) values.

edited Jan 27 '20 at 15:55

answered Jan 27 '20 at 15:01

Jasper-M

14,966
2
26
37

Understanding Optimized Logical Plan

1 Answers1