Suppose you have a dataframe with columns of various types (string, double...) and a special value "miss" that represents "missing value" in string-typed columns.
from pyspark.sql import SparkSession
import pandas as pd
spark = SparkSession.builder.getOrCreate()
pdf = pd.DataFrame([
[1, 'miss'],
[2, 'x'],
[None, 'y']
], columns=['intcol', 'strcol'])
df = spark.createDataFrame(data=pdf)
I am trying to count the number of non-missing values for each column, using filtering like this:
col = df['strcol']
df.filter(col.isNotNull() & (col != 'miss')).show()
Which works for the string column:
+------+------+
|intcol|strcol|
+------+------+
| 2.0| x|
| NaN| y|
+------+------+
However, for the numeric column, it filters out all the rows:
col = df['intcol']
df.filter(col.isNotNull() & (col != 'miss')).show()
+------+------+
|intcol|strcol|
+------+------+
+------+------+
It seems like this is because the cross-type comparison of the numeric column with string value results in all-null values:
df.select(df['intcol'] != 'miss').show()
+---------------------+
|(NOT (intcol = miss))|
+---------------------+
| null|
| null|
| null|
+---------------------+
Which I find a bit unexpected (e.g. 1 != ''
is True, not null in "normal" Python)
My question is really several questions:
- why does the cross type comparison results in nulls?
- what is the best way to test for equality/non-equality across different types in the "expected way"? Or (in my case) do I need to include separate logic that switches based on the type of the column?
- It seems like
df.filter(~df['intcol'].isin(['miss']))
does the job, but I wonder if that is less efficient?