Spark internals: benefits of Project

Question

I've read this question in which the OP tried to convert this logical plan:

Aggregate [sum(inc(vals#4L)) AS sum(inc(vals))#7L]
+- LocalRelation [vals#4L]

To this:

Aggregate [sum(inc_val#6L) AS sum(inc(vals))#7L]
+- Project [inc(vals#4L) AS inc_val#6L]
   +- LocalRelation [vals#4L]

I have a couple of questions:

Why did he want to have the Project operator? What is the advantage of it?
As far as I know, Project is the operator that represents the SELECT statement, so how could possibly a plan not include a Project operator?

DaRkMaN · Answer 1 · 2020-02-03T10:37:33.507

Premise: Aggregation involves two-steps:

Basically Aggregate is a two-stage operation.

Why did he want to have the Project operator? What is the advantage of it?
I can think of two scenarios:
Codegen enabled: With code-gen enabled, the code for the entire stage(including that of UDF) is run within a single iterator. This can potentially result in memory issues when the UDF is CPU intensive/memory intensive etc . Note that aggregation is happening within the same iterator.
Codegen disabled: With code-gen disabled, each operator will run in its own iterator. The possibility of OOM/CPU getting high reduces as we have decoupled aggregation from UDF run.
As far as I know, Project is the operator that represents the SELECT statement, so how could possibly a plan not include a Project operator?
As long as the last operator returns the required columns, we don't need an explicit project. For example, say top-most operator(say Filter) is returning 3 columns. If the requirement is that we require 3 columns as output, No explicit project is required. If we require only 2 out of 3, we need an explicit project. An optimizer rule does this as part of query planning. https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L440

Thank you for your answer. I understand section 2. I don't understand section 1. How does Project help to solve the problems you mentioned? — Alon, Feb 04 '20 at 12:39
This difference is in the way the query is executed. Without codegen, Spark uses valcano optimizer model for execution, in which each operator is executed seperately, where the chance of errors(such as OOM etc) is less. Whereas when codegen is enabled, all the operators are collapsed and executed as single function and hence chance of errors increase(especially when one of the operators involve UDFs) — DaRkMaN, Feb 05 '20 at 06:00
I understand the advantage of disabling Codegen. I just don't understand what it has to do with the fact that the OP wanted to have a Project operator in his plan. — Alon, Feb 05 '20 at 08:54
Got it. I was always assuming that the OP would be disabling codegen and basing my answers on it :) Maybe there are other reasons to use it (like metrics for Project operator etc, but from code there arent any) — DaRkMaN, Feb 05 '20 at 10:31

1 Answers1