1

I tried using .agg(avg("boolean_column")), but got the error:

"function average requires numeric types, not boolean"

How can I get the average of such a column?

BirdLaw
  • 572
  • 6
  • 16
  • Say you have two values: true, false. What's the average? – Óscar López Jun 18 '19 at 18:20
  • The avg of true and false implies the average of 1 and 0. i.e. 0.5 – BirdLaw Jun 18 '19 at 18:20
  • @ÓscarLópez For example: if you have a binary prediction problem where success is denoted by a boolean, we can take the average of this as an integer to calculate the "success rate" – pault Jun 18 '19 at 18:21
  • Of course. but how in pyspark? Generally, I think pyspark is so unintuitive – BirdLaw Jun 18 '19 at 18:21
  • Not quite a duplicate, but related to [how to change a Dataframe column from String type to Double type in pyspark](https://stackoverflow.com/questions/32284620/how-to-change-a-dataframe-column-from-string-type-to-double-type-in-pyspark) – pault Jun 18 '19 at 18:28

1 Answers1

4

Convert the column to a numeric type, then take the average:

from pyspark.sql.functions import avg, col
df.groupBy(...).agg(avg(col("boolean_column").cast("double")))
pault
  • 41,343
  • 15
  • 107
  • 149