Catalyst optimizer phases

Question

Im studing about the phases of the catalyst optimizer but Im with some doubts how the three first phases work in practice.

In the first phase (analysis phase) the otimizer will create a logical plan of the query. But here the columns are unresolved so it need to use a catalog object for this.

Doubt: Do you know how this catalog object works so solve this, for example if we are executing queries over hive tables, the optimizer connects to the hivetables in hdfs to resolve the columns?

In the second phase (logic optimization) the otimizer applies standard rules to the logical plan like constant folding, predicate pushdowns and project pruning.

Doubt: Im trying to find examples to understand better what spark really does in this phase, how constant folding, predicate pushsdown and project pruning things help to optimize the query but Im not finding nothing in concrete about this.

In the third phase (physical planning) spark takes the logical plan and generate one or more physical plans, using physical operators that match the Spark execution engine.

Doubt: Do you understand this part "using physical operators that match the spark execution engine"?

zero323 · Accepted Answer · 2016-05-12T23:45:29.457

Do you know how this catalog object works so solve this, for example if we are executing queries over hive tables, the optimizer connects to the hivetables in hdfs to resolve the columns?

There is no single answer here. Basic catalog is SessionCatalog which serves only as a proxy to actual ExternalCatalog. Spark provides two different implementations of the ExternalCatalog out-of-the-box: InMemoryCatalog and HiveExternalCatalog which correspond to standard SQLContext and HiveContext respectively. Obviously the former one may access Hive metastore but there should be no data access otherwise.

In Spark 2.0+ catalog can be queried directly using SparkSession.catalog for example:

val df = Seq(("a", 1), ("b", 2)).toDF("k", "v")
// df: org.apache.spark.sql.DataFrame = [k: string, v: int]

spark.catalog.listTables
// org.apache.spark.sql.Dataset[org.apache.spark.sql.catalog.Table] = 
//   [name: string, database: string ... 3 more fields]

constant folding

This is not in any particular way specific to Catalyst. It is just a standard compilation technique and its benefits should be obvious. It is better to compute expression once than repeat this for each row

predicate pushdown

Predicates correspond to WHERE clause in the SQL query. If these can be use directly be external system (like relational database) or for partition pruning (like in Parquet) this means reduced amount of data that has to be transfered / loaded from disk.

and project pruning

Benefits are pretty much the same as for predicate pushdown. If some columns are not used downstream data source may discard this on read.

Do you understand this part using physical operators

DataFrame is just a high level abstraction. Internally operations have to be translated to basic operations on RDDs, usually some combination of mapPartitions.

Thanks for yor answer. Im just with a doubt about predicate pushdown, so it seems that it helps to reduce amount of the data across the network. So this is only about use the where clause so we can filter the results and consequently the amount of the data transfer across the network is less? — codin, May 13 '16 at 18:02

Catalyst optimizer phases

1 Answers1