Catalyst optimizer makes use of standard features of Scala programming like pattern matching. In the depth, Catalyst contains the tree and the set of rules to manipulate the tree. There are specific libraries to process relational queries. There are various rule sets which handle different phases of query execution like analysis, query optimization, physical planning, and code generation to compile parts of queries to Java bytecode.
Questions tagged [catalyst-optimizer]
27 questions
6
votes
0 answers
Parquet filter pushdown is not working with Spark Dataset API
Here is the sample code which i am running.
Creating a test parquet Dataset with mod column as partition.
scala> val test = spark.range(0 , 100000000).withColumn("mod", $"id".mod(40))
test: org.apache.spark.sql.DataFrame = [id: bigint, mod:…

Kaushal
- 3,237
- 3
- 29
- 48
5
votes
1 answer
For "iterative algorithms," what is the advantage of converting to an RDD then back to a Dataframe
I am reading High Performance Spark and the author makes the following claim:
While the Catalyst optimizer is quite powerful, one of the cases where it currently runs into challenges is with very large query plans. These query plans tend to be the…

Allen Han
- 1,163
- 7
- 16
4
votes
1 answer
What is the role of Catalyst optimizer and Project Tungsten
I am unclear on the roles of Catalyst optimizer and Project Tungsten.
My understanding is that Catalyst optimizer will produce optimized Physical plan from logical plan. The optimized physical plan will then taken by Code generator to emit Rdd's.
Is…

Surendiran Balasubramanian
- 59
- 1
- 3
3
votes
1 answer
Rewrite LogicalPlan to push down udf from aggregate
I have defined an UDF which increases the input value by one, named "inc", this is the code of my udf
spark.udf.register("inc", (x: Long) => x + 1)
this is my test sql
val df = spark.sql("select sum(inc(vals)) from…

adream307
- 95
- 1
- 7
3
votes
0 answers
How to create custom Spark-Native functions without forking/modifying Spark itself
I am looking into converting some UDFs/UDAFs to Spark-Native functions to leverage Catalyst and codegen.
Looking through some examples (for example: https://github.com/apache/spark/pull/7214/files for Levenshtein) it seems like we need to add…

cozos
- 787
- 10
- 19
2
votes
0 answers
How to structure large queries in spark
I have recently converted an enormous SAS datastep program to pyspark and I think the query is so large that the Catalyst optimizer causes an OOM error in the driver. I am able to run the query when I increase the driver memory to 256gb, but…

Zelazny7
- 39,946
- 18
- 70
- 84
2
votes
1 answer
Apache Spark dataframe lineage trimming via RDD and role of cache
There is the following trick how to trim Apache Spark dataframe lineage, especially for iterative computations:
def getCachedDataFrame(df: DataFrame): DataFrame = {
val rdd = df.rdd.cache()
df.sqlContext.createDataFrame(rdd, df.schema)
}
It…

alexanoid
- 24,051
- 54
- 210
- 410
2
votes
3 answers
Dataframe API vs Spark.sql
Does writing the code in Dataframe API format rather than Spark.sql queries have any significance advantage ?
Would like to know whether Catalyst optimizer would be working on spark.sql queries also or not .

Vijaya Bhaskar
- 43
- 5
2
votes
1 answer
Which optimizations do UDFs not benefit from?
Spark UDF's contain the following functions: nullable, deterministic, dataType, etc. So according to this information, it would benefit from optimizations such as ConstantFolding. Which other optimizations does it benefit from and which…

abden003
- 1,325
- 7
- 24
- 48
1
vote
1 answer
Why would finding an aggregate of a partition column in Spark 3 take very long time?
I'm trying to query the MIN(dt) in a table partitioned by dt column using the following query in both Spark2 and Spark3:
SELECT MIN(dt) FROM table_name
The table is stored in parquet format in S3, where each dt is a separate folder, so this seems…

RyanCheu
- 3,522
- 5
- 38
- 47
1
vote
0 answers
How do you inspect candidate logical plans of cost-based SQL optimizer in spark (scala)?
For a project, I want to find a way to select the top-K resolved logical plans given a SQL query in spark, based on a cost-based optimizer. Is anyone aware of a spark SQL cost-based optimizer that computes some candidate plans where I could choose…

wpunter
- 11
- 1
1
vote
0 answers
is it possible to avoid second exchange when spark joins two datasets using joinWith?
For the following scrap of code:
case class SomeRow(key: String, value: String)
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
val ds1 = Seq(SomeRow("A", "1")).toDS().repartition(col("key"))
val ds2 = Seq(SomeRow("A", "1"),…

dpolaczanski
- 386
- 1
- 3
- 18
1
vote
0 answers
Apache Spark What is the difference between requiredChildDistribution and outputPartitioning?
In Apache Spark, each physical operator in the physical plan has 4 properties:
outputPartitioning
outputOrdering
requiredChildDistribution
requiredChildOrdering
But aren't outputPartioning and requiredChildDistribution the same? How are they…

Arjunlal M.A
- 119
- 6
1
vote
0 answers
Is it possible to outperform the Catalyst optimizer on highly skewed data using only RDDs
I am reading High Performance Spark and the author introduces a technique that can be used to perform joins on highly skewed data by selectively filtering the data to build a HashMap with the data containing the most common keys. This HashMap is…

Allen Han
- 1,163
- 7
- 16
1
vote
1 answer
steps in spark physical plan not assigned to DAG step
I am trying to debug a simple query in spark SQL that is returning incorrect data.
In this instance the query is a simple join between two hive tables ..
The issue seems tied to the fact that a the physical plan that spark has generated (with…

user276537
- 11
- 3