How do I create a new column has the count of all the row values that are greater than 0 in pyspark?

Question

Suppose I have a pyspark data frame as:

col1 col2  col3
1     2    -3
2     null  5
4     4     8
1     0     9

I want to add a column called check where it counts the number of values that are greater than 0.

The final output will be:

col1 col2  col3     check
    1     2    -3    2
    2     null  5    2
    4     4     8    3
    1     0     9    2

I was trying this. But, it didn't help and errors out as below:

df= df.withColumn("check", sum((df[col] > 0) for col in df.columns))

Invalid argument, not a string or column: <generator object at 0x7f0a866ae580> of type <class 'generator'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.

score 1 · Answer 1 · answered Feb 01 '22 at 23:01

1

Don't know if there is a simpler SQL based solution or not, but it's pretty straight forward with a udf.

count_udf = udf(lambda arr: sum([1 for a in arr if a > 0]), IntegerType())
df.withColumn('check', count_udf(array('col1', 'col2', 'col3'))).show()

Not sure if it'll handle nulls. Add null check (if a and a > 0) in udf if needed.

Idea: https://stackoverflow.com/a/42540401/496289

Your code shows you doing a sum of non-zero columns, not count. If you need sum then

count_udf = udf(lambda arr: sum([a for a in arr if a > 0]), IntegerType())

answered Feb 01 '22 at 23:01

Kashyap

15,354
13
64
103

This didn't work. It throws an error when trying to display: PythonException: 'TypeError: '>' not supported between instances of 'NoneType' and 'int'', from , line 3. Full traceback below: – Messy Feb 01 '22 at 23:09
@Messy, *"Not sure if it'll handle nulls. Add null check (if a and a > 0) in udf if needed."* – Kashyap Feb 02 '22 at 22:07

score 1 · Answer 2 · answered Feb 02 '22 at 00:30

Create a new column array and filter the newly created column finally count the elements in the column.

Example:

df.show(10,False)
#+----+----+----+
#|col1|col2|col3|
#+----+----+----+
#|1   |2   |-3  |
#|2   |null|5   |
#+----+----+----+

df.withColumn("check",expr("size(filter(array(col1,col2), x -> x > 0))")).show(10,False)
#+----+----+----+-----+
#|col1|col2|col3|check|
#+----+----+----+-----+
#|1   |2   |-3  |2    |
#|2   |null|5   |1    |
#+----+----+----+-----+

score 1 · Answer 3 · answered Feb 02 '22 at 11:26

You can use functools.reduce to sum the list of columns from df.columns if > 0 like this:

from pyspark.sql import functions as F
from operator import add
from functools import reduce


df = spark.createDataFrame([
    (1, 2, -3), (2, None, 5), (4, 4, 8), (1, 0, 9)
], ["col1", "col2", "col3"])

df = df.withColumn(
    "check",
    reduce(add, [F.when(F.col(c) > 0, 1).otherwise(0) for c in df.columns])
)

df.show()
#+----+----+----+-----+
#|col1|col2|col3|check|
#+----+----+----+-----+
#|   1|   2|  -3|    2|
#|   2|null|   5|    2|
#|   4|   4|   8|    3|
#|   1|   0|   9|    2|
#+----+----+----+-----+

How do I create a new column has the count of all the row values that are greater than 0 in pyspark?

3 Answers3