13

I am trying to join two DataFrames with each other after some performing some earlier computation. The command is simple:

employee.join(employer, employee("id") === employer("id"))

However, the join seems to perform carthesian join, completely ignoring my === statement. Does anyone has an idea why is this happening?

blackbishop
  • 30,945
  • 11
  • 55
  • 76
NNamed
  • 717
  • 1
  • 7
  • 14
  • Welcome to SO NNamed. If you're asking for help you should give us a chance :) Good place to start is to provide [Minimal, Complete, and Verifiable example](http://stackoverflow.com/help/mcve). – zero323 Aug 24 '15 at 20:23

1 Answers1

36

I think I fought with the same issue. Check if you have a warning:

Constructing trivially true equals predicate [..]

After creating the join operation. If so, just alias one of the columns in either employee or employer DataFrame, e.g. like this:

employee.select(<columns you want>, employee("id").as("id_e"))

Then perform join on employee("id_e") === employer("id").

Explanation. Look at this operation flow:

enter image description here

If you directly use your DataFrame A to compute DataFrame B and join them together on the column Id, which comes from the DataFrame A, you will not be performing the join you want to do. The ID column from DataFrameB is in fact the exactly same column from the DataFrameA, so spark will just assert that the column is equal with itself and hence the trivially true predicate. To avoid this, you have to alias one of the columns so that they will appear as "different" columns for spark. For now only the warning message has been implemented in this way:

    def === (other: Any): Column = {
    val right = lit(other).expr
    if (this.expr == right) {
      logWarning(
        s"Constructing trivially true equals predicate, '${this.expr} = $right'. " +
          "Perhaps you need to use aliases.")
    }
    EqualTo(expr, right)
  }

It is not a very good solution solution for me (it is really easy to miss the warning message), I hope this will be somehow fixed.

You are lucky though to see the warning message, it has been added not so long ago ;).

TheMP
  • 8,257
  • 9
  • 44
  • 73
  • 1
    I was struggling for most of a weekend trying to fix join issues in 1.5.2 -- this was one of the two issues and your answer saved a lot of frustration. Thank you! – Pyrce Apr 11 '16 at 04:56
  • Had to go through the same frustration until I started digging into Spark code ;). – TheMP Apr 13 '16 at 07:58
  • Well, its not fixed even today (2020). Had the same issue. Thanks Niemand. Saved a lot of my time. – Abira Aug 10 '20 at 13:37
  • 2023: this behaviour still observed, spent 3 days fixing it while came here! – Dima Naychuk May 31 '23 at 23:02