1

I have created a Spark dataframe by joining on a UNIQUE_ID created with the following code:

ddf_A.join(ddf_B, ddf_A.UNIQUE_ID_A == ddf_B.UNIQUE_ID_B, how = 'inner').limit(5).toPandas()

The UNIQUE_ID (dtype = 'int') is created in the initial dataframe by using the following code:

row_number().over(Window.orderBy(lit(1))

Both ddf_A and ddf_B are created as subsets from the initial dataframe by using inner joins with two additional tables. The UNIQUE_ID has been renamed in both dataframes by using an alias to UNIQUE_ID_A and UNIQUE_ID_B respectively.

The result (5 rows) of the inner join between ddf_A and ddf_B looks as follows:

|----|------------------|-------------------|
|    |    UNIQUE_ID_A   |    UNIQUE_ID_B    |
|----|------------------|-------------------|
|  0 |         451123   |         451123    |
|  1 |         451149   |         451149    |
|  2 |         451159   |         451159    |
|  3 |         451345   |         451345    |
|  4 |         451487   |         451487    |
|----|------------------|-------------------|

This looks acceptable to me at first sight. However, I can't find 451123 in ddf_A with the following code:

ddf_A.filter(col('UNIQUE_ID_A') == 451123).show()

Do you have any idea what's wrong here?

pault
  • 41,343
  • 15
  • 107
  • 149
Jari1995
  • 11
  • 1
  • What's the purpose of `row_number().over(Window.orderBy(lit(1))` - isn't that going to give an arbitrary ordering? If you're looking for assigning a unique ID, you should use [`pyspark.sql.functions.monotonically_increasing_id()`](http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.monotonically_increasing_id) – pault May 16 '18 at 12:46
  • I used monotonically_increasing_ID() first and faced the same issue, then I tried this one – Jari1995 May 16 '18 at 12:52
  • Hard to debug without a [mcve]. Do you think it's possible to break your dataframes down to a small sample in such a way that we can recreate your issue? – pault May 16 '18 at 12:54
  • [Spark join produces wrong results](https://stackoverflow.com/q/36131942/9613318) – Alper t. Turker May 16 '18 at 13:31
  • @user9613318: join on aliases, one of the solutions proposed in your link, is exactly what I'm doing, or am I missing something? – Jari1995 May 16 '18 at 13:50
  • Please try: `ddf_A.withColumn("uid", col("UNIQUE_ID_A")).join(ddf_B.withColumn("uid", col("UNIQUE_ID_B")), on="uid", how = 'inner').drop("uid").limit(5).toPandas()`. – pault May 16 '18 at 14:22
  • @pault: I just tried your code but unfortunately it doesn't work either. The same issue remains. – Jari1995 May 16 '18 at 14:53
  • I don't see any aliases in your code. – Alper t. Turker May 16 '18 at 17:07
  • @user9613318: it's mentioned in the post but I'm sure you know that – Jari1995 May 16 '18 at 20:10
  • You renamed non-join columns, not aliased tables. – Alper t. Turker May 16 '18 at 21:13

0 Answers0