Im studing about the phases of the catalyst optimizer but Im with some doubts how the three first phases work in practice.
In the first phase (analysis phase) the otimizer will create a logical plan of the query. But here the columns are unresolved so it need to use a catalog object for this.
Doubt: Do you know how this catalog object works so solve this, for example if we are executing queries over hive tables, the optimizer connects to the hivetables in hdfs to resolve the columns?
In the second phase (logic optimization) the otimizer applies standard rules to the logical plan like constant folding, predicate pushdowns and project pruning.
Doubt: Im trying to find examples to understand better what spark really does in this phase, how constant folding, predicate pushsdown and project pruning things help to optimize the query but Im not finding nothing in concrete about this.
In the third phase (physical planning) spark takes the logical plan and generate one or more physical plans, using physical operators that match the Spark execution engine.
Doubt: Do you understand this part "using physical operators that match the spark execution engine"?