3

I'm writing a function to outputting a dataframe with the difference between two dataframes. Simplified, this looks like this:

differences = df1.join(df2, df1['id'] == df2['id'], how='full') \
  .select(F.coalesce(df1['id'], df2['id']).alias('id'), df1['name'], df2['name'])
  .where(df1['name'] != df2['name'])

With the following 2 datasets, I expect the 3rd to be the output:

+---+-----+
| id| name|
+---+-----+
|  1|Alice|
|  2|  Bob|
|  3|Carol|
|  4|  Dan|
|  5|  Eve|
+---+-----+

+---+-----+
| id| name|
+---+-----+
|  1|Alice|
|  2|  Ben|
|  4|  Dan|
|  5|  Eve|
|  6| Finn|
+---+-----+

+---+-------+-------+
|age|   name|   name|
+---+-------+-------+
|  2|    Bob|    Ben|
|  3|  Carol|   null|
|  6|   null|   Finn|
+---+-------+-------+

But when I run it in databricks, the null columns are omitted from the result dataframe.

+---+-------+-------+
|age|   name|   name|
+---+-------+-------+
|  2|    Bob|    Ben|
+---+-------+-------+

Are they not considered != by the where clause? Is there hidden logic when these frames are created?

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
Ben
  • 757
  • 1
  • 7
  • 15

1 Answers1

2

From Why [table].[column] != null is not working?:

"NULL in a database is not a value. It means something like "unknown" or "data missing".

You cannot tell if something where you don't have any information about is equal to something else where you also don't have any information about (=, != operators). But you can say whether there is any information available (IS NULL, IS NOT NULL)."

So you will have to add more conditions:

differences = df1.join(df2, df1['id'] == df2['id'], how='full')\
  .select(F.coalesce(df1['id'], df2['id']).alias('id'), df1['Name'], df2['Name'])\
  .where((df1['Name'] != df2['Name']) | ((df1['Name']).isNull() | (df2['Name']).isNull()))


+---+-----+----+
|id |Name |Name|
+---+-----+----+
|2  |Bob  |Ben |
|3  |Carol|null|
|6  |null |Finn|
+---+-----+----+
Luiz Viola
  • 2,143
  • 1
  • 11
  • 30
  • There is also a null-safe equal operator in Spark `<=>`. using DataFrame API: `.where(~df1["Name"].eqNullSafe(df2["Name"]))`. Or in SQL expression: `F.expr("df1.Name <=> df2.name")`. See [sql-ref-null-semantics](https://spark.apache.org/docs/3.0.0-preview/sql-ref-null-semantics.html#comparision-operators-) – blackbishop Nov 06 '21 at 11:23