5

I have a very wide df with a large number of columns. I need to get the count of non-null values per row for this in python.

Example DF -

+-----+----------+-----+-----+-----+-----+-----+-----+
| name|      date|col01|col02|col03|col04|col05|col06|
+-----+----------+-----+-----+-----+-----+-----+-----+
|name1|2017-12-01|100.0|255.5|333.3| null|125.2|132.7|
|name2|2017-12-01|101.1|105.5| null| null|127.5| null|

I want to add a column with a count of non-null values in col01-col06 -

+-----+----------+-----+-----+-----+-----+-----+-----+-----+
| name|      date|col01|col02|col03|col04|col05|col06|count|
+-----+----------+-----+-----+-----+-----+-----+-----+-----+
|name1|2017-12-01|100.0|255.5|333.3| null|125.2|132.7|    5| 
|name2|2017-12-01|101.1|105.5| null| null|127.5| null|    3|

I was able to get this in a pandas df like this -

df['count']=df.loc[:,'col01':'col06'].notnull().sum(axis=1)     

But no luck with spark df so far :( Any ideas?

NITS
  • 207
  • 4
  • 15
  • 1
    Possible duplicate of [Spark DataFrame: Computing row-wise mean (or any aggregate operation)](https://stackoverflow.com/questions/32670958/spark-dataframe-computing-row-wise-mean-or-any-aggregate-operation) – pault Apr 05 '19 at 02:07

1 Answers1

7

Convert the null values to true/false, then to integers, then sum them:

from pyspark.sql import functions as F
from pyspark.sql.types import IntegerType

df = spark.createDataFrame([[1, None, None, 0], 
                            [2, 3, 4, None], 
                            [None, None, None, None], 
                            [1, 5, 7, 2]], 'a: int, b: int, c: int, d: int')

df.select(sum([F.isnull(df[col]).cast(IntegerType()) for col in df.columns]).alias('null_count')).show()

Output:

+----------+
|null_count|
+----------+
|         2|
|         1|
|         4|
|         0|
+----------+

gmds
  • 19,325
  • 4
  • 32
  • 58
  • Thank you so much gmds! This is exactly what I was looking for. I made a slight update to this to subtract this number from the total count (as I wanted the non-null count) and used withColumn to add the new column and that was it :) – NITS Apr 05 '19 at 14:52
  • @user11308481 no problem! (I didn't see you wanted the *non* null count; my bad) – gmds Apr 05 '19 at 14:59
  • @gmds I get the error that "TypeError: 'Column' object is not callable", any suggestions? – Maeror Jan 28 '21 at 06:23