The withColumn
and filter
execute in the order they are called. The plan explains it. Please read the plan bottom up.
val employees = spark.createDataFrame(Seq(("E1",100.0), ("E2",200.0),("E3",300.0))).toDF("employee","salary")
employees.withColumn("column1", when(col("salary") > 200, lit("rich")).otherwise("poor")).filter(col("column1")==="poor").explain(true)
Plan - project happened 1st then filter.
== Parsed Logical Plan ==
'Filter ('column1 = poor)
+- Project [employee#4, salary#5, CASE WHEN (salary#5 > cast(200 as double)) THEN rich ELSE poor END AS column1#8]
+- Project [_1#0 AS employee#4, _2#1 AS salary#5]
+- LocalRelation [_1#0, _2#1]
== Analyzed Logical Plan ==
employee: string, salary: double, column1: string
Filter (column1#8 = poor)
+- Project [employee#4, salary#5, CASE WHEN (salary#5 > cast(200 as double)) THEN rich ELSE poor END AS column1#8]
+- Project [_1#0 AS employee#4, _2#1 AS salary#5]
+- LocalRelation [_1#0, _2#1]
Code 1st filters then adds new column
employees.filter(col("employee")==="E1").withColumn("column1", when(col("salary") > 200, lit("rich")).otherwise("poor")).explain(true)
Plan - 1st filters then projects
== Parsed Logical Plan ==
'Project [employee#4, salary#5, CASE WHEN ('salary > 200) THEN rich ELSE poor END AS column1#13]
+- Filter (employee#4 = E1)
+- Project [_1#0 AS employee#4, _2#1 AS salary#5]
+- LocalRelation [_1#0, _2#1]
== Analyzed Logical Plan ==
employee: string, salary: double, column1: string
Project [employee#4, salary#5, CASE WHEN (salary#5 > cast(200 as double)) THEN rich ELSE poor END AS column1#13]
+- Filter (employee#4 = E1)
+- Project [_1#0 AS employee#4, _2#1 AS salary#5]
+- LocalRelation [_1#0, _2#1]
Another evidence - it gives error when filter is called on a column before adding it (obviously)
employees.filter(col("column1")==="poor").withColumn("column1", when(col("salary") > 200, lit("rich")).otherwise("poor")).show()
org.apache.spark.sql.AnalysisException: cannot resolve '`column1`' given input columns: [employee, salary];;
'Filter ('column1 = poor)
+- Project [_1#0 AS employee#4, _2#1 AS salary#5]
+- LocalRelation [_1#0, _2#1]