3

What is the difference between:

F.when(F.col('Name').isNull())

and:

F.when(F.col('Name') == None)

They don't appear to work the same. The first one seems to work better when checking for null values in a column. Sometimes the second method doesn't work for checking null Names. For example if I wanted to check null values and replace the Names that are null to "Missing name" or something, the second method won't do anything sometimes. I'm not sure why it works some times and not other times. But the first method always works.

user2864740
  • 60,010
  • 15
  • 145
  • 220
Nicholas Tree
  • 55
  • 1
  • 7

3 Answers3

2
F.when(F.col('Name') == None)

None is an object in python, NoneType. It's not the same as an empty string or 0. For this instance, you would want to use

F.when(F.col('Name') is None)

Using == you're checking to see if F.col('Name') value equals the None object which is going to throw things off.

Harben
  • 1,802
  • 1
  • 11
  • 16
  • The OP asks about “isNull” vs “== None”. While “is None” *is* preferred, such is secondary and should have no relevant impact .. maybe a miscopy of the original question? – user2864740 Jul 18 '20 at 19:29
  • I tried to explain OP's original question with an example. Pls take a look. Hope this explanation will help people to undertand difference b/w isNull and == None – Shantanu Kher Jul 19 '20 at 23:26
2

Refer here : Filter Pyspark dataframe column with None value

Equality based comparisons with NULL won't work because in SQL NULL is undefined so any attempt to compare it with another value returns NULL

user2864740
  • 60,010
  • 15
  • 145
  • 220
Raghu
  • 1,644
  • 7
  • 19
  • Coming from a LINQ background, feels like a historical (annoyance) restriction in the mapping to SQL .. oh well. For reference, LINQ normally converts “== null” (code) to “is NULL”, separating the rules of each layer. Is this specific to Spark, or how does the Python Database API define the operation? That is, can the answer (in Python) be generalized more consistently? – user2864740 Jul 18 '20 at 21:06
1

In data world, two Null values (or for the matter two None) are not identical.

Therefore, if you perform == or != operation with two None values, it always results in False. That is the key reason isNull() or isNotNull() functions are built for.

Please take a look at below example for better understanding -

Creating a dataframe with few valid records and one record with None

from pyspark.sql.types import *
from pyspark.sql import Row
from pyspark.sql.functions import col

lst = [(1,'sometext'),(2,''),(3, None),(4, 'someothertext')]

myrdd = sc.parallelize(lst).map(lambda x: Row(id=x[0], txt=x[1]))
mydf= sqlContext.createDataFrame(myrdd)

isNull() returns True for Row#3, thus below statement returns one row -

mydf.filter(col("txt").isNull()).show(truncate=False)
+---+----+
|id |txt |
+---+----+
|3  |null|
+---+----+

== operator returns False for Row#3, thus no records are filtered out.

mydf.filter(col("txt") == None).show(truncate=False)
+---+---+
|id |txt|
+---+---+
+---+---+
Shantanu Kher
  • 1,014
  • 1
  • 8
  • 14