-4

I'm reading this question and I'm trying to reproduce the problem.

However I get a completely deifferent optimized plan than the op's one.

Op's code:

spark.udf.register("inc", (x: Long) => x + 1)
val df = spark.sql("select sum(inc(vals)) from data")
df.explain(true)
df.show()

Op's optimized plan:

== Optimized Logical Plan ==
Aggregate [sum(inc(vals#4L)) AS sum(inc(vals))#7L]
+- LocalRelation [vals#4L]

My code:

val df1 = spark.read
  .option("sep", "\t")
  .option("inferSchema", "true")
  .option("header", "true")
  .csv("D:/playground/pylearner.tsv")
    .withColumn("Month_year",trim($"Month_year"))

df1.createOrReplaceTempView("pylearner")

df1.show()

spark.udf.register("inc", (x: Long) => x + 1)
val df = spark.sql("select sum(inc(ID)) from pylearner")
df.explain(true)
df.show()

My optimized plan:

== Optimized Logical Plan ==
Aggregate [sum(if (isnull(cast(ID#10 as bigint))) null else UDF:inc(cast(ID#10 as bigint))) AS sum(UDF:inc(cast(ID as bigint)))#45L]
+- Project [ID#10]
   +- Relation[ID#10,TYPE_ID#11,Month_year#12,Amount#13] csv
Alon
  • 10,381
  • 23
  • 88
  • 152

1 Answers1

2

OP is converting a local Seq[Long] to a DataFrame. You are reading a csv file from disk. Hence OP has a LocalRelation, while you have some extra stuff for reading in a csv, selecting the ID column out of it, and handling possibly missing (null) values.

Jasper-M
  • 14,966
  • 2
  • 26
  • 37