I am trying to join the following 2 dataframes:
val df1 = Seq(
("Verizon", "USA"),
("AT & T", "PK"),
("Verizon", "IND")
).toDF("Brand", "Country")
val df2 = Seq(
(8, "USA"),
(64, "UK"),
(-27, "DE")
).toDF("TS", "Country")
If I join like this, it worked:
df1.join(df2, Seq("Country")).count
But when I try to use withColumn() and lit() before join (to replace the column values), it throws an exception:
df1.withColumn("Country", lit("USA")).join(df2, Seq("Country")).count
Exception:
org.apache.spark.sql.AnalysisException: Detected implicit cartesian product for INNER join between logical plans
LocalRelation
and
Project
+- Filter (isnotnull(_2#680) && (USA = _2#680))
+- LocalRelation [_1#679, _2#680]
Join condition is missing or trivial.
Either: use the CROSS JOIN syntax to allow cartesian products between these
relations, or: enable implicit cartesian products by setting the configuration
variable spark.sql.crossJoin.enabled=true;
at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1124)
at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$21.applyOrElse(Optimizer.scala:1121)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
...
It also works when I use crossjoin
:
df1.withColumn("Country", lit("USA")).crossJoin(df2.filter(col("Country") === "USA"))
But I don't understand why it is not working with simple join. why do I need to use cross join to make it work. Any help would be appreciated. Thanks