Issue while joining two dataframes having null values in PySpark

Question

Issue while joining two dataframes having null values in PySpark.

We are having two PySpark datasets.

df1 :-

-----------------------------------
|ID |Name|DOB                |Ht  |
-----------------------------------
|101|aaa |2000-01-02 00:00:00|155 |

|102|bbb |2000-01-03 00:00:00|null|

|103|ccc |null               |170 |

df2 :-

---------------------------------------
|ID |Age|DOB                |Name|Ht  |
---------------------------------------
|101|31 |2000-01-02 00:00:00|aaa |155 |

|102|32 |2000-01-03 00:00:00|bbb |null|

|103|33 |null               |ccc |170 |

Case 1: We need to full join df1 and df2 with keys 'ID' and 'Ht' The code that we used is shown below.

The columns that are having same names in both data frames are renamed before joining.

    df_2  =df2
    rename_vars = ['Name','DOB']
    rename_vars1 = ['Name_1','DOB_1']
    for colname,newcolname in zip(rename_vars,rename_vars1):  
        df_2  =  df_2.withColumnRenamed(colname,newcolname)


    merge2  =  df1.join(df_2, on =['ID','Ht'],  how  = 'full')
    merge2.printSchema()
    merge2.show()

The output is as shown below.

merge2 :-

-------------------------------------------------------------------
|ID |Ht  |Name|DOB                |Age |DOB_1              |Name_1|
-------------------------------------------------------------------
|101|155 |aaa |2000-01-02 00:00:00|31  |2000-01-02 00:00:00|aaa   |

|102|null|bbb |2000-01-03 00:00:00|null|null               |null  |

|102|null|null|null               |32  |2000-01-03 00:00:00|bbb   |

|103|170 |ccc |null               |33  |null               |ccc   |

We are expecting only one record for the value corresponding to ID ='102'. Here, since the value for 'Ht' is null, PySpark is not taking null as a unique value and two records are added to the merged dataset. The expected result is shown below.

merge2 :-

-------------------------------------------------------------------
|ID |Ht  |Name|DOB                |Age |DOB_1              |Name_1|
-------------------------------------------------------------------
|101|155 |aaa |2000-01-02 00:00:00|31  |2000-01-02 00:00:00|aaa   |

|102|null|bbb |2000-01-03 00:00:00|32  |2000-01-03 00:00:00|bbb   |

|103|170 |ccc |null               |33  |null               |ccc   |

Appreciate your valuable help in solving this issue.

I am using PySpark version 2.2.1 and Column.eqNullSafe option is available only in Spark 2.3.0 or later. Any other solution to this issue with Spark version 2.2.1? — Sreenath Chothar, Sep 17 '18 at 11:33

Issue while joining two dataframes having null values in PySpark

0 Answers0