-1

I did the following actions:

  1. loaded in a json as a spark dataframe
  2. analyzed data from (5) columns of this dataframe
  3. applied a function to the data extracted from these 5 columns (binned continuous values into 10 bins by percentile although I don't think the details of this matter)
  4. created a new dataframe using spark.createDataFrame, containing all of these new values with 5 completely different column names
  5. attempted a full outer join of the original dataframe with the new dataframe.

Because all of the columns in my synthesized dataframe have different names from the columns in the original dataframe, an outer join should be the same as simply concatenating the two dataframes along the column axis.

However, instead I receive this error:

AnalysisException: u'Detected implicit cartesian product for FULL OUTER join between logical plans\nUnion\n:- Project\n:

How do I resolve this? I simply want to concatenate the dataframes by column like in https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html

AAC
  • 563
  • 1
  • 6
  • 18
  • An outer join will not merge the dataframes together as you think (shuffling operations can change the row order so you don't have any gurantee that the order is what you think it is). It would be easier to avoid this situation by not creating a new dataframe and simple working with a single one, but if you really need to then this should help: https://stackoverflow.com/questions/40508489/spark-merge-2-dataframes-by-adding-row-index-number-on-both-dataframes – Shaido Apr 02 '19 at 01:38
  • How would I do this by working with a single dataframe? – AAC Apr 02 '19 at 05:06
  • I don't know, there isn't enough information in the question to answer that. Maybe [`Bucketizer`](https://spark.apache.org/docs/latest/ml-features.html#bucketizer) can help? You could consider asking a new question regarding this with more informaiton. – Shaido Apr 02 '19 at 05:28

1 Answers1

1

Depending on your implementation, you will need to set:

spark.sql.crossJoin.enabled = true

Jimmy
  • 127
  • 10
  • But I do not want a cartesian product, I only want to concatenate the dataframes along the column axis. – AAC Apr 02 '19 at 00:45
  • Understood. If you know that the join will be 1 to 1, you shouldn't have any issues. Spark doesn't know that and can't assume that since many-to-many would be an extremely expensive job. – Jimmy Apr 02 '19 at 01:51