I have created a Spark dataframe by joining on a UNIQUE_ID created with the following code:
ddf_A.join(ddf_B, ddf_A.UNIQUE_ID_A == ddf_B.UNIQUE_ID_B, how = 'inner').limit(5).toPandas()
The UNIQUE_ID (dtype = 'int')
is created in the initial dataframe by using the following code:
row_number().over(Window.orderBy(lit(1))
Both ddf_A and ddf_B are created as subsets from the initial dataframe by using inner joins with two additional tables. The UNIQUE_ID has been renamed in both dataframes by using an alias to UNIQUE_ID_A and UNIQUE_ID_B respectively.
The result (5 rows) of the inner join between ddf_A and ddf_B looks as follows:
|----|------------------|-------------------|
| | UNIQUE_ID_A | UNIQUE_ID_B |
|----|------------------|-------------------|
| 0 | 451123 | 451123 |
| 1 | 451149 | 451149 |
| 2 | 451159 | 451159 |
| 3 | 451345 | 451345 |
| 4 | 451487 | 451487 |
|----|------------------|-------------------|
This looks acceptable to me at first sight. However, I can't find 451123 in ddf_A with the following code:
ddf_A.filter(col('UNIQUE_ID_A') == 451123).show()
Do you have any idea what's wrong here?