0

I am trying to join the following 2 dataframes:

val df1 = Seq(
      ("Verizon", "USA"),
      ("AT & T", "PK"),
      ("Verizon", "IND")
    ).toDF("Brand", "Country")

    val df2 = Seq(
      (8, "USA"),
      (64, "UK"),
      (-27, "DE")
    ).toDF("TS", "Country")

If I join like this, it worked:

df1.join(df2, Seq("Country")).count

But when I try to use withColumn() and lit() before join (to replace the column values), it throws an exception:

df1.withColumn("Country", lit("USA")).join(df2, Seq("Country")).count

Exception:

org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for INNER join between logical plans
LocalRelation
and
Project
+- Filter (isnotnull(_2#680) && (USA = _2#680))
   +- LocalRelation [_1#679, _2#680]
Join condition is missing or trivial.
Either: use the CROSS JOIN syntax to allow cartesian products between these
relations, or: enable implicit cartesian products by setting the configuration
variable spark.sql.crossJoin.enabled=true;
  at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1124)
  at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1121)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
  at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
   ...

It also works when I use crossjoin:

df1.withColumn("Country", lit("USA")).crossJoin(df2.filter(col("Country") === "USA"))

But I don't understand why it is not working with simple join. why do I need to use cross join to make it work. Any help would be appreciated. Thanks

Waqar Ahmed
  • 5,005
  • 2
  • 23
  • 45
  • 1
    Possible duplicate of [Detected cartesian product for INNER join on literal column in PySpark](https://stackoverflow.com/questions/53449100/detected-cartesian-product-for-inner-join-on-literal-column-in-pyspark) – 10465355 Jan 21 '19 at 18:52
  • That one is related to sort merge join. It isn't. – Waqar Ahmed Jan 21 '19 at 18:56
  • Setting `sql("set spark.sql.crossJoin.enabled=true")` before doing the join worked for me. But don't know why it'll work. – Md Shihab Uddin Jan 21 '19 at 19:33

1 Answers1

0

Spark analyzer detected cross join condition when you are intended to use inner join.

Since cross join is costly so default behavior to throw exception when Physical plan detects Cross join state where query didn't used cross join explicitly.

This is happening due to replaced column formed from literal.

Cross join behavior explanation is explained in more details in thread that user10465355 mentioned.

Ramdev Sharma
  • 974
  • 1
  • 12
  • 17
  • That is my question that why I am getting the cross join excetpion. What is the reason for that? – Waqar Ahmed Jan 22 '19 at 09:23
  • This is happening due to replaced column formed from literal 'USA' . So if you do inner join using column that is formed from this type column you will hit this exception. This is simple rule validation in analyzer – Ramdev Sharma Jan 22 '19 at 12:20
  • So you mean to say, dataframe with lit column doesn't work with join? If yes, then what is the correct way to have join with lit without using cross join as it is expensive operation. – Waqar Ahmed Jan 22 '19 at 12:39
  • It is expensive so if you still want to use it. Use using crosjoin syntax. Based on our use case if we have to do cross join, we do using crossjoin so analyzer know that you are intended to use crossjoin and will not complain. – Ramdev Sharma Jan 22 '19 at 12:42
  • But I don't want to use cross join. Is there any otherway to solve this problem? – Waqar Ahmed Jan 22 '19 at 12:48
  • If one column has the same value and it is the key column to be joined. Does spark imagine that it is cross join? – Waqar Ahmed Jan 22 '19 at 12:50
  • Yes, it will always be cross join since all rows are same value in column that used for join. – Ramdev Sharma Jan 22 '19 at 13:00
  • Now, i got it. Thanks! – Waqar Ahmed Jan 22 '19 at 13:01