Issue while joining two dataframes having null values in PySpark.
We are having two PySpark datasets.
df1 :-
-----------------------------------
|ID |Name|DOB |Ht |
-----------------------------------
|101|aaa |2000-01-02 00:00:00|155 |
|102|bbb |2000-01-03 00:00:00|null|
|103|ccc |null |170 |
df2 :-
---------------------------------------
|ID |Age|DOB |Name|Ht |
---------------------------------------
|101|31 |2000-01-02 00:00:00|aaa |155 |
|102|32 |2000-01-03 00:00:00|bbb |null|
|103|33 |null |ccc |170 |
Case 1: We need to full join df1 and df2 with keys 'ID' and 'Ht' The code that we used is shown below.
The columns that are having same names in both data frames are renamed before joining.
df_2 =df2
rename_vars = ['Name','DOB']
rename_vars1 = ['Name_1','DOB_1']
for colname,newcolname in zip(rename_vars,rename_vars1):
df_2 = df_2.withColumnRenamed(colname,newcolname)
merge2 = df1.join(df_2, on =['ID','Ht'], how = 'full')
merge2.printSchema()
merge2.show()
The output is as shown below.
merge2 :-
-------------------------------------------------------------------
|ID |Ht |Name|DOB |Age |DOB_1 |Name_1|
-------------------------------------------------------------------
|101|155 |aaa |2000-01-02 00:00:00|31 |2000-01-02 00:00:00|aaa |
|102|null|bbb |2000-01-03 00:00:00|null|null |null |
|102|null|null|null |32 |2000-01-03 00:00:00|bbb |
|103|170 |ccc |null |33 |null |ccc |
We are expecting only one record for the value corresponding to ID ='102'. Here, since the value for 'Ht' is null, PySpark is not taking null as a unique value and two records are added to the merged dataset. The expected result is shown below.
merge2 :-
-------------------------------------------------------------------
|ID |Ht |Name|DOB |Age |DOB_1 |Name_1|
-------------------------------------------------------------------
|101|155 |aaa |2000-01-02 00:00:00|31 |2000-01-02 00:00:00|aaa |
|102|null|bbb |2000-01-03 00:00:00|32 |2000-01-03 00:00:00|bbb |
|103|170 |ccc |null |33 |null |ccc |
Appreciate your valuable help in solving this issue.