Questions tagged [catalyst-optimizer]

Catalyst optimizer makes use of standard features of Scala programming like pattern matching. In the depth, Catalyst contains the tree and the set of rules to manipulate the tree. There are specific libraries to process relational queries. There are various rule sets which handle different phases of query execution like analysis, query optimization, physical planning, and code generation to compile parts of queries to Java bytecode.

27 questions

votes

0 answers

Parquet filter pushdown is not working with Spark Dataset API

Here is the sample code which i am running. Creating a test parquet Dataset with mod column as partition. scala> val test = spark.range(0 , 100000000).withColumn("mod", $"id".mod(40)) test: org.apache.spark.sql.DataFrame = [id: bigint, mod:…

asked Jun 23 '18 at 13:35

Kaushal

3,237
3
29
48

votes

1 answer

For "iterative algorithms," what is the advantage of converting to an RDD then back to a Dataframe

I am reading High Performance Spark and the author makes the following claim: While the Catalyst optimizer is quite powerful, one of the cases where it currently runs into challenges is with very large query plans. These query plans tend to be the…

apache-spark apache-spark-sql rdd catalyst-optimizer

asked Apr 29 '20 at 00:24

Allen Han

1,163
7
16

votes

1 answer

What is the role of Catalyst optimizer and Project Tungsten

I am unclear on the roles of Catalyst optimizer and Project Tungsten. My understanding is that Catalyst optimizer will produce optimized Physical plan from logical plan. The optimized physical plan will then taken by Code generator to emit Rdd's. Is…

apache-spark apache-spark-sql catalyst-optimizer

asked Apr 21 '21 at 03:29

Surendiran Balasubramanian

votes

1 answer

Rewrite LogicalPlan to push down udf from aggregate

I have defined an UDF which increases the input value by one, named "inc", this is the code of my udf spark.udf.register("inc", (x: Long) => x + 1) this is my test sql val df = spark.sql("select sum(inc(vals)) from…

scala apache-spark catalyst-optimizer

asked Jan 21 '20 at 11:13

adream307

votes

0 answers

How to create custom Spark-Native functions without forking/modifying Spark itself

I am looking into converting some UDFs/UDAFs to Spark-Native functions to leverage Catalyst and codegen. Looking through some examples (for example: https://github.com/apache/spark/pull/7214/files for Levenshtein) it seems like we need to add…

apache-spark user-defined-functions catalyst-optimizer

asked Aug 22 '19 at 09:34

cozos

votes

0 answers

How to structure large queries in spark

I have recently converted an enormous SAS datastep program to pyspark and I think the query is so large that the Catalyst optimizer causes an OOM error in the driver. I am able to run the query when I increase the driver memory to 256gb, but…

pyspark apache-spark-sql databricks catalyst-optimizer

asked Dec 17 '22 at 21:54

Zelazny7

39,946
18
70
84

votes

1 answer

Apache Spark dataframe lineage trimming via RDD and role of cache

There is the following trick how to trim Apache Spark dataframe lineage, especially for iterative computations: def getCachedDataFrame(df: DataFrame): DataFrame = { val rdd = df.rdd.cache() df.sqlContext.createDataFrame(rdd, df.schema) } It…

dataframe apache-spark rdd data-lineage catalyst-optimizer

asked Jul 20 '21 at 07:18

alexanoid

24,051
54
210
410

votes

3 answers

Dataframe API vs Spark.sql

Does writing the code in Dataframe API format rather than Spark.sql queries have any significance advantage ? Would like to know whether Catalyst optimizer would be working on spark.sql queries also or not .

dataframe apache-spark catalyst-optimizer

asked Feb 24 '21 at 17:55

Vijaya Bhaskar

votes

1 answer

Which optimizations do UDFs not benefit from?

Spark UDF's contain the following functions: nullable, deterministic, dataType, etc. So according to this information, it would benefit from optimizations such as ConstantFolding. Which other optimizations does it benefit from and which…

apache-spark apache-spark-sql catalyst-optimizer

asked Apr 22 '19 at 08:10

abden003

1,325
7
24
48

vote

1 answer

Why would finding an aggregate of a partition column in Spark 3 take very long time?

I'm trying to query the MIN(dt) in a table partitioned by dt column using the following query in both Spark2 and Spark3: SELECT MIN(dt) FROM table_name The table is stored in parquet format in S3, where each dt is a separate folder, so this seems…

apache-spark apache-spark-sql spark3 catalyst-optimizer

asked Jan 24 '23 at 01:52

RyanCheu

3,522
5
38
47

vote

0 answers

How do you inspect candidate logical plans of cost-based SQL optimizer in spark (scala)?

For a project, I want to find a way to select the top-K resolved logical plans given a SQL query in spark, based on a cost-based optimizer. Is anyone aware of a spark SQL cost-based optimizer that computes some candidate plans where I could choose…

sql apache-spark cost-based-optimizer catalyst-optimizer

asked Apr 22 '22 at 13:23

wpunter

vote

0 answers

is it possible to avoid second exchange when spark joins two datasets using joinWith?

For the following scrap of code: case class SomeRow(key: String, value: String) spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1) val ds1 = Seq(SomeRow("A", "1")).toDS().repartition(col("key")) val ds2 = Seq(SomeRow("A", "1"),…

apache-spark join apache-spark-dataset catalyst-optimizer

asked Dec 07 '21 at 21:14

dpolaczanski

vote

0 answers

Apache Spark What is the difference between requiredChildDistribution and outputPartitioning?

In Apache Spark, each physical operator in the physical plan has 4 properties: outputPartitioning outputOrdering requiredChildDistribution requiredChildOrdering But aren't outputPartioning and requiredChildDistribution the same? How are they…

scala apache-spark apache-spark-sql hdfs catalyst-optimizer

asked Sep 24 '21 at 09:39

Arjunlal M.A

vote

0 answers

Is it possible to outperform the Catalyst optimizer on highly skewed data using only RDDs

I am reading High Performance Spark and the author introduces a technique that can be used to perform joins on highly skewed data by selectively filtering the data to build a HashMap with the data containing the most common keys. This HashMap is…

apache-spark join apache-spark-sql rdd catalyst-optimizer

asked Apr 29 '20 at 01:23

Allen Han

1,163
7
16

vote

1 answer

steps in spark physical plan not assigned to DAG step

I am trying to debug a simple query in spark SQL that is returning incorrect data. In this instance the query is a simple join between two hive tables .. The issue seems tied to the fact that a the physical plan that spark has generated (with…

apache-spark hive cloudera hortonworks-data-platform catalyst-optimizer

asked Feb 10 '20 at 17:54

user276537

2 Next