I'm writing a function to outputting a dataframe with the difference between two dataframes. Simplified, this looks like this:
differences = df1.join(df2, df1['id'] == df2['id'], how='full') \
.select(F.coalesce(df1['id'], df2['id']).alias('id'), df1['name'], df2['name'])
.where(df1['name'] != df2['name'])
With the following 2 datasets, I expect the 3rd to be the output:
+---+-----+
| id| name|
+---+-----+
| 1|Alice|
| 2| Bob|
| 3|Carol|
| 4| Dan|
| 5| Eve|
+---+-----+
+---+-----+
| id| name|
+---+-----+
| 1|Alice|
| 2| Ben|
| 4| Dan|
| 5| Eve|
| 6| Finn|
+---+-----+
+---+-------+-------+
|age| name| name|
+---+-------+-------+
| 2| Bob| Ben|
| 3| Carol| null|
| 6| null| Finn|
+---+-------+-------+
But when I run it in databricks, the null columns are omitted from the result dataframe.
+---+-------+-------+
|age| name| name|
+---+-------+-------+
| 2| Bob| Ben|
+---+-------+-------+
Are they not considered != by the where clause? Is there hidden logic when these frames are created?