Get the number of null per row in PySpark dataframe

Question

This is probably a duplicate, but somehow I have been searching for a long time already:

I want to get the number of nulls per Row in a Spark dataframe. I.e.

col1 col2 col3
null    1    a
   1    2    b
   2    3 null

Should in the end be:

col1 col2 col3 number_of_null
null    1    a              1
   1    2    b              0
   2    3 null              1

In a general fashion, I want to get the number of times a certain string or number appears in a spark dataframe row.

I.e.

col1 col2 col3  number_of_ABC
 ABC    1    a              1
   1    2    b              0
   2  ABC  ABC              2

I am using Pyspark 2.3.0 and prefer a solution that does not involve SQL syntax. For some reason, I seem not to be able to google this. :/

EDIT: Assume that I have so many columns that I can't list them all.

EDIT2: I explicitely dont want to have a pandas solution.

EDIT3: The solution explained with sums or means does not work as it throws errors:

(data type mismatch: differing types in '((`log_time` IS NULL) + 0)' (boolean and int))
...
isnull(log_time#10) + 0) + isnull(log#11))

Possible duplicate of [Spark DataFrame: Computing row-wise mean (or any aggregate operation)](https://stackoverflow.com/questions/32670958/spark-dataframe-computing-row-wise-mean-or-any-aggregate-operation) — pault, Sep 21 '18 at 14:30
See the linked dupe: `df.select(sum(col(x).isNull() for x in df.columns)).alias("number_of_null")` — pault, Sep 21 '18 at 14:31
When doing exactly that on my dataset I receive: `py4j.protocol.Py4JJavaError: An error occurred while calling o1999.select. : org.apache.spark.sql.AnalysisException: cannot resolve '((`log_time` IS NULL) + 0)' due to data type mismatch: differing types in '((`log_time` IS NULL) + 0)' (boolean and int).;;` — guscht, Sep 21 '18 at 14:46
cast the boolean to int: `df.select(sum((col(x).isNull()).cast("int") for x in df.columns)).alias("number_of_null")` — pault, Sep 21 '18 at 14:53
please accept (and consider voting on) the duplicate target if you can. — pault, Sep 21 '18 at 14:59
in python2 I get the error 'column is not callable'. in python3 I get 'column is not iterable'. I have the same issue with the linked duplicate solution. — Michael West, Sep 21 '18 at 16:18
@MichaelWest `sum` here refers to `__builtin__.sum` not `pyspark.sql.functions.sum`. This won't work if you did `from pyspark.sql.functions import *` (which would be [bad practice](https://stackoverflow.com/questions/2386714/why-is-import-bad) anyway). — pault, Sep 21 '18 at 16:23
perfect. I actually need `pyspark.sql.functions.sum` , so I specify `__builtin__.sum` explicitly — Michael West, Sep 21 '18 at 16:35

score 0 · Answer 1 · answered Sep 22 '18 at 07:28

In Scala:

val df = List(
  ("ABC", "1", "a"),
  ("1", "2", "b"),
  ("2", "ABC", "ABC")
).toDF("col1", "col2", "col3")
val expected = "ABC"
val complexColumn: Column = df.schema.fieldNames.map(c => when(col(c) === lit(expected), 1).otherwise(0)).reduce((a, b) => a + b)
df.withColumn("countABC", complexColumn).show(false)

Output:

+----+----+----+--------+
|col1|col2|col3|countABC|
+----+----+----+--------+
|ABC |1   |a   |1       |
|1   |2   |b   |0       |
|2   |ABC |ABC |2       |
+----+----+----+--------+

score 0 · Answer 2 · answered Sep 22 '18 at 09:56

As stated in pasha701's answer, I resort to map and reduce. Note that I am working on Spark 1.6.x and Python 2.7

Taking your DataFrame as df (and as is)

dfvals = [
  (None, "1", "a"),
  ("1", "2", "b"),
  ("2", None, None)
]

df = sqlc.createDataFrame(dfvals, ['col1', 'col2', 'col3'])

new_df = df.withColumn('null_cnt', reduce(lambda x, y: x + y,
                                         map(lambda x: func.when(func.isnull(func.col(x)) == 'true', 1).otherwise(0),
                                             df.schema.names)))

Check if the value is Null and assign 1 or 0. Add the result to get the count.

new_df.show()

+----+----+----+--------+
|col1|col2|col3|null_cnt|
+----+----+----+--------+
|null|   1|   a|       1|
|   1|   2|   b|       0|
|   2|null|null|       2|
+----+----+----+--------+

Get the number of null per row in PySpark dataframe

2 Answers2