Joining two dataframes in pyspark ends up in 'Detected implicit cartesian product' error

Question

I did the following actions:

loaded in a json as a spark dataframe
analyzed data from (5) columns of this dataframe
applied a function to the data extracted from these 5 columns (binned continuous values into 10 bins by percentile although I don't think the details of this matter)
created a new dataframe using spark.createDataFrame, containing all of these new values with 5 completely different column names
attempted a full outer join of the original dataframe with the new dataframe.

Because all of the columns in my synthesized dataframe have different names from the columns in the original dataframe, an outer join should be the same as simply concatenating the two dataframes along the column axis.

However, instead I receive this error:

AnalysisException: u'Detected implicit cartesian product for FULL OUTER join between logical plans\nUnion\n:- Project\n:

How do I resolve this? I simply want to concatenate the dataframes by column like in https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html

An outer join will not merge the dataframes together as you think (shuffling operations can change the row order so you don't have any gurantee that the order is what you think it is). It would be easier to avoid this situation by not creating a new dataframe and simple working with a single one, but if you really need to then this should help: https://stackoverflow.com/questions/40508489/spark-merge-2-dataframes-by-adding-row-index-number-on-both-dataframes — Shaido, Apr 02 '19 at 01:38
I don't know, there isn't enough information in the question to answer that. Maybe [`Bucketizer`](https://spark.apache.org/docs/latest/ml-features.html#bucketizer) can help? You could consider asking a new question regarding this with more informaiton. — Shaido, Apr 02 '19 at 05:28

score 1 · Answer 1 · answered Apr 01 '19 at 23:29

1

Depending on your implementation, you will need to set:

spark.sql.crossJoin.enabled = true

answered Apr 01 '19 at 23:29

Jimmy

127
10

But I do not want a cartesian product, I only want to concatenate the dataframes along the column axis. – AAC Apr 02 '19 at 00:45
Understood. If you know that the join will be 1 to 1, you shouldn't have any issues. Spark doesn't know that and can't assume that since many-to-many would be an extremely expensive job. – Jimmy Apr 02 '19 at 01:51

Joining two dataframes in pyspark ends up in 'Detected implicit cartesian product' error

1 Answers1