Join in PySpark gives unexpected results

Question

I have created a Spark dataframe by joining on a UNIQUE_ID created with the following code:

ddf_A.join(ddf_B, ddf_A.UNIQUE_ID_A == ddf_B.UNIQUE_ID_B, how = 'inner').limit(5).toPandas()

The UNIQUE_ID (dtype = 'int') is created in the initial dataframe by using the following code:

row_number().over(Window.orderBy(lit(1))

Both ddf_A and ddf_B are created as subsets from the initial dataframe by using inner joins with two additional tables. The UNIQUE_ID has been renamed in both dataframes by using an alias to UNIQUE_ID_A and UNIQUE_ID_B respectively.

The result (5 rows) of the inner join between ddf_A and ddf_B looks as follows:

|----|------------------|-------------------|
|    |    UNIQUE_ID_A   |    UNIQUE_ID_B    |
|----|------------------|-------------------|
|  0 |         451123   |         451123    |
|  1 |         451149   |         451149    |
|  2 |         451159   |         451159    |
|  3 |         451345   |         451345    |
|  4 |         451487   |         451487    |
|----|------------------|-------------------|

This looks acceptable to me at first sight. However, I can't find 451123 in ddf_A with the following code:

ddf_A.filter(col('UNIQUE_ID_A') == 451123).show()

Do you have any idea what's wrong here?

What's the purpose of `row_number().over(Window.orderBy(lit(1))` - isn't that going to give an arbitrary ordering? If you're looking for assigning a unique ID, you should use [`pyspark.sql.functions.monotonically_increasing_id()`](http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.monotonically_increasing_id) — pault, May 16 '18 at 12:46
I used monotonically_increasing_ID() first and faced the same issue, then I tried this one — Jari1995, May 16 '18 at 12:52
Hard to debug without a [mcve]. Do you think it's possible to break your dataframes down to a small sample in such a way that we can recreate your issue? — pault, May 16 '18 at 12:54
[Spark join produces wrong results](https://stackoverflow.com/q/36131942/9613318) — Alper t. Turker, May 16 '18 at 13:31
@user9613318: join on aliases, one of the solutions proposed in your link, is exactly what I'm doing, or am I missing something? — Jari1995, May 16 '18 at 13:50
Please try: `ddf_A.withColumn("uid", col("UNIQUE_ID_A")).join(ddf_B.withColumn("uid", col("UNIQUE_ID_B")), on="uid", how = 'inner').drop("uid").limit(5).toPandas()`. — pault, May 16 '18 at 14:22
@pault: I just tried your code but unfortunately it doesn't work either. The same issue remains. — Jari1995, May 16 '18 at 14:53
@user9613318: it's mentioned in the post but I'm sure you know that — Jari1995, May 16 '18 at 20:10

Join in PySpark gives unexpected results

0 Answers0