-1

Hi All I have 2 dataframes in i am comparing values of both the dataframe and based on value assigning value to one new dataframe. all the scenarios are working fine expect null fields comparision i.e. if in both the dataframe values are null then it should show as "varified" but its giving me as "not varified" I am sharing my dataframes data and code which i'm using and result of final dataframe below.

    scala> df1.show()
    +---+-----+---+--------+------+-------+
    | id| name|age|lastname|  city|country|
    +---+-----+---+--------+------+-------+
    |  1|rohan| 26|  sharma|mumbai|  india|
    |  2|rohan| 26|  sharma|  null|  india|
    |  3|rohan| 26|    null|mumbai|  india|
    |  4|rohan| 26|  sharma|mumbai|  india|
    +---+-----+---+--------+------+-------+
    scala> df2.show()
    +----+------+-----+----------+------+---------+
    |o_id|o_name|o_age|o_lastname|o_city|o_country|
    +----+------+-----+----------+------+---------+
    |   1| rohan|   26|    sharma|mumbai|    india|
    |   2| rohan|   26|    sharma|  null|    india|
    |   3| rohan|   26|    sharma|mumbai|    india|
    |   4| rohan|   26|      null|mumbai|    india|
    +----+------+-----+----------+------+---------+

    val df3 = df1.join(df2, df1("id") === df2("o_id"))
    .withColumn("result", when(df1("name") === df2("o_name") && 
    df1("age") === df2("o_age") && 
    df1("lastname") === df2("o_lastname") && 
    df1("city") === df2("o_city")  &&
    df1("country") === df2("o_country"), "Varified")
    .otherwise("Not Varified")).show()

    +---+-----+---+--------+------+-------+----+------+-----+----------+------+---------+------------+
    | id| name|age|lastname|  city|country|o_id|o_name|o_age|o_lastname|o_city|o_country|      result|
    +---+-----+---+--------+------+-------+----+------+-----+----------+------+---------+------------+
    |  1|rohan| 26|  sharma|mumbai|  india|   1| rohan|   26|    sharma|mumbai|    india|    Varified|
    |  2|rohan| 26|  sharma|  null|  india|   2| rohan|   26|    sharma|  null|    india|Not Varified|
    |  3|rohan| 26|    null|mumbai|  india|   3| rohan|   26|    sharma|mumbai|    india|Not Varified|
    |  4|rohan| 26|  sharma|mumbai|  india|   4| rohan|   26|      null|mumbai|    india|Not Varified|
    +---+-----+---+--------+------+-------+----+------+-----+----------+------+---------+------------+

I want that for id '2' also it should show as 'Varified'.but the city is null in both the column then its showing as 'Not Varified'. Can someone Please guide me how should i Modify my df3 query so it can check null also and for for id '2' also can show as 'Varified' in result column.

Up Ap
  • 107
  • 2
  • 12
  • Generally, `null != null` (in sql, and in spark). But there's an operator `<=>` (null-safe equality) that might help you. see https://stackoverflow.com/questions/41728762/including-null-values-in-an-apache-spark-join for details – Rayan Ral May 16 '20 at 19:32
  • Possible duplicate of [Including null values in an Apache Spark Join](https://stackoverflow.com/questions/41728762/including-null-values-in-an-apache-spark-join) – user10938362 May 17 '20 at 12:06

2 Answers2

3

use <=> instead of ===

val df3 = df1.join(df2, df1("id") === df2("o_id"))
    .withColumn("result", when(df1("name") <=> df2("o_name") && 
    df1("age") <=> df2("o_age") && 
    df1("lastname") <=> df2("o_lastname") && 
    df1("city") <=> df2("o_city")  &&
    df1("country") <=> df2("o_country"), "Varified")
    .otherwise("Not Varified")).show()
spark.sql("SELECT NULL AS city1, NULL AS city2").select($"city1" <=> $"city2").show

results

+-----------------+
|(city1 <=> city2)|
+-----------------+
|            true |
+-----------------+
Ranga Vure
  • 1,922
  • 3
  • 16
  • 23
1

In your when+otherwise statement add <=> (or) || operator and check .isNull for last_name and city column.

null=null returns null reason behind why we are not able to match.

spark.sql("select null=null").show()
//+-------------+
//|(NULL = NULL)|
//+-------------+
//|         null|
//+-------------+

Using <=>,isnull():

spark.sql("select null<=>null, isnull(null) = isnull(null)").show()
//+---------------+---------------------------------+
//|(NULL <=> NULL)|((NULL IS NULL) = (NULL IS NULL))|
//+---------------+---------------------------------+
//|           true|                             true|
//+---------------+---------------------------------+

Example:

df1.join(df2, df1("id") === df2("o_id")).
withColumn("result", when( (df1("name") === df2("o_name")) && (df1("age") === df2("o_age") ) && 
(df1("lastname") === df2("o_lastname")|| (df1("lastname").isNull === df2("o_lastname").isNull)) && 
(df1("city") === df2("o_city")|| (df1("city").isNull === df2("o_city").isNull))  && 
(df1("country") === df2("o_country")), "Varified").otherwise("Not Varified")).
show()

//or using <>
df1.join(df2, df1("id") === df2("o_id")).withColumn("result", when( (df1("name") === df2("o_name")) && (df1("age") === df2("o_age")) && (df1("lastname") <=> df2("o_lastname")) && (df1("city") <=> df2("o_city"))  && (df1("country") === df2("o_country")), "Varified").otherwise("Not Varified")).show()

//+---+-----+---+--------+------+-------+----+------+-----+----------+------+---------+------------+
//| id| name|age|lastname|  city|country|o_id|o_name|o_age|o_lastname|o_city|o_country|      result|
//+---+-----+---+--------+------+-------+----+------+-----+----------+------+---------+------------+
//|  1|rohan| 26|  sharma|mumbai|  india|   1| rohan|   26|    sharma|mumbai|    india|    Varified|
//|  2|rohan| 26|  sharma|  null|  india|   2| rohan|   26|    sharma|  null|    india|    Varified|
//|  3|rohan| 26|    null|mumbai|  india|   3| rohan|   26|    sharma|mumbai|    india|Not Varified|
//|  4|rohan| 26|  sharma|mumbai|  india|   4| rohan|   26|      null|mumbai|    india|Not Varified|
//+---+-----+---+--------+------+-------+----+------+-----+----------+------+---------+------------+
notNull
  • 30,258
  • 4
  • 35
  • 50