Different behaviour of same query in Spark 2.3 vs Spark 3.2

Question

I am running a simple query in two versions of spark, 2.3 & 3.2. The code is as below

spark-shell --master yarn --deploy-mode client

val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5")
val op_cols = List("id","col2","col3","col4", "col5", "ID")
val df2 = df1.select(op_cols.head, op_cols.tail: _*)
df2.select("id").show()

In spark 2.3 it returns

+----+
| id |
+----+
| 1  |
| 1  |
+----+

But in spark 3.2 it returns

org.apache.spark.sql.AnalysisException: Reference 'id' is ambiguous, could be: id, id.;
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:213)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:97)

I was expecting both versions to have the same result or at least a configuration to make the behavior consistent. setting don't change behavior

spark.sql.analyzer.failAmbiguousSelfJoin=false
spark.sql.caseSensitive=False

On top of this, when using both columns in same case, it works

val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5")
val op_cols = List("id","col2","col3","col4", "col5", "id")
val df2 = df1.select(op_cols.head, op_cols.tail: _*)
df2.select("id").show()

Even further analysis points out that this behavior was introduced in 2.4. I mean the same query fails even in spark version 2.4

Oli · Answer 1 · 2023-02-06T23:08:45.113

0

By default, spark is not case sensitive. In spark 3.X, with the following option activated, it works the same way as in spark 2.3.

spark.conf.set("spark.sql.caseSensitive", "true")

I tried to dig a little deeper about the difference of behavior between 2.3 and 3.2. I found a simpler example that reproduces the problem. In spark 2.3, without case sensitivity (the default), this does not fail.

spark.range(1).select("id", "ID").select("id").explain
== Physical Plan ==
*(1) Range (0, 1, step=1, splits=4)

We see that spark simplifies the select so that it does not have to deal with the ambiguity.

In 3.X however, it fails. I tried setting spark.sql.analyzer.failAmbiguousSelfJoin to false since it was set by default to true (https://spark.apache.org/docs/latest/sql-migration-guide.html) as of 3.0 but that does not change de result.

edited Feb 06 '23 at 23:08

answered Feb 06 '23 at 08:56

Oli

9,766
5
25
46

when I set spark.conf.set("spark.sql.caseSensitive", "true") in 3.2 it gives error while executing val df2 = df1.select(op_cols.head, op_cols.tail: _*), when set to false this line executes and later on failes on the next commnd (select) – ASR Feb 06 '23 at 09:22
just an update, if we just set the parameter (caseSensitive one) just before the last select it does work, but that will be a wrong place to insert the setting, it should be from the begining.. dynamic switching the setting will be an ugly and recurring code – ASR Feb 06 '23 at 11:28
Your dataframe contains these columns `["id","col2","col3","col4", "col5"]`. When you do `val df2 = df1.select(op_cols.head, op_cols.tail: _*)`, you try to select the column `ID`. With `spark.sql.caseSensitive` set to false, spark allows it since `id` exists and case is not taken into account. With `spark.sql.caseSensitive` set to true, spark tells you right away that it cannot find the `ID` column since only `id` exists. – Oli Feb 06 '23 at 13:42
I understand that part, my question was why the two versions have different behavior. In 2.3 without setting the case sensitivity (default false) all four commands succeed, in 3.2 it fails. – ASR Feb 06 '23 at 14:52
I looked into it and for now, the inly thing I can say is that something has changed between the two versions :D I even asked chatGPT about it and she has no clue either! – Oli Feb 06 '23 at 23:22

score 0 · Accepted Answer · answered Apr 13 '23 at 03:49

The error was introduced in Spark 2.4 when code was added under expression. In Spark 2.3 we had distinct on the candidates, but later code only had candidates/prunedCandidates did not have distinct added. Once we add the distinct while doing resolve of attributes for plan the behavior is same as that of 2.3

PR for this fix is merged in Spark 3.4 branch. See: https://github.com/apache/spark/pull/40258

Different behaviour of same query in Spark 2.3 vs Spark 3.2

2 Answers2