Pyspark Dataframe Difference - Where param != null not returning?

Question

I'm writing a function to outputting a dataframe with the difference between two dataframes. Simplified, this looks like this:

differences = df1.join(df2, df1['id'] == df2['id'], how='full') \
  .select(F.coalesce(df1['id'], df2['id']).alias('id'), df1['name'], df2['name'])
  .where(df1['name'] != df2['name'])

With the following 2 datasets, I expect the 3rd to be the output:

+---+-----+
| id| name|
+---+-----+
|  1|Alice|
|  2|  Bob|
|  3|Carol|
|  4|  Dan|
|  5|  Eve|
+---+-----+

+---+-----+
| id| name|
+---+-----+
|  1|Alice|
|  2|  Ben|
|  4|  Dan|
|  5|  Eve|
|  6| Finn|
+---+-----+

+---+-------+-------+
|age|   name|   name|
+---+-------+-------+
|  2|    Bob|    Ben|
|  3|  Carol|   null|
|  6|   null|   Finn|
+---+-------+-------+

But when I run it in databricks, the null columns are omitted from the result dataframe.

+---+-------+-------+
|age|   name|   name|
+---+-------+-------+
|  2|    Bob|    Ben|
+---+-------+-------+

Are they not considered != by the where clause? Is there hidden logic when these frames are created?

score 2 · Accepted Answer · answered Nov 05 '21 at 21:31

From Why [table].[column] != null is not working?:

"NULL in a database is not a value. It means something like "unknown" or "data missing".

You cannot tell if something where you don't have any information about is equal to something else where you also don't have any information about (=, != operators). But you can say whether there is any information available (IS NULL, IS NOT NULL)."

So you will have to add more conditions:

differences = df1.join(df2, df1['id'] == df2['id'], how='full')\
  .select(F.coalesce(df1['id'], df2['id']).alias('id'), df1['Name'], df2['Name'])\
  .where((df1['Name'] != df2['Name']) | ((df1['Name']).isNull() | (df2['Name']).isNull()))


+---+-----+----+
|id |Name |Name|
+---+-----+----+
|2  |Bob  |Ben |
|3  |Carol|null|
|6  |null |Finn|
+---+-----+----+

There is also a null-safe equal operator in Spark `<=>`. using DataFrame API: `.where(~df1["Name"].eqNullSafe(df2["Name"]))`. Or in SQL expression: `F.expr("df1.Name <=> df2.name")`. See [sql-ref-null-semantics](https://spark.apache.org/docs/3.0.0-preview/sql-ref-null-semantics.html#comparision-operators-) — blackbishop, Nov 06 '21 at 11:23

Pyspark Dataframe Difference - Where param != null not returning?

1 Answers1