pyspark calculate mean of all columns in one line

Question

I would like to calculate the mean value of each column without specifying all the columns name.

So for example instead of doing:

res = df.select([mean('col1'), mean('col2')])

I would like to do something equivalent to:

res = df.select([mean('*')])

Is that possible?

Possible duplicate of [Apply a transformation to multiple columns pyspark dataframe](https://stackoverflow.com/questions/48452076/apply-a-transformation-to-multiple-columns-pyspark-dataframe) — pault, Dec 19 '18 at 16:28
Thanks @pault. I am not sure it is a complete duplicate but maybe you can see better than me :) — roschach, Dec 19 '18 at 16:33
the general concept of apply a function to every column using a list comprehension is duplicated. I'll see if I can find time to update the linked Q&A to be more generic. — pault, Dec 19 '18 at 16:35
Maybe a better duplicate: https://stackoverflow.com/questions/33882894/sparksql-apply-aggregate-functions-to-a-list-of-column — pault, Dec 19 '18 at 16:38
@cph_sto that is to unpack the list to pass in each column expression as an argument. — pault, Dec 19 '18 at 18:39

score 6 · Answer 1 · answered Nov 10 '21 at 09:57

similar solution but maybe a little easier to read:

results= df.agg(*(avg(c).alias(c) for c in df.columns))

to easily retrieve the information use:

results.first().asDict()

it will be handy when you use it to fill NaN like:

df.na.fill(results.first().asDict())

no promotion just gratitude from my side: I picked that cool trick up in a wonderfull pyspark-class by Layla AI (Pyspark Essentials for Data Scientists)

score 5 · Answer 2 · answered May 09 '20 at 14:04

5

You can do it by

res  = df.select(*[f.mean(c).alias(c) for c in df.columns])

answered May 09 '20 at 14:04

Rahul Kumar

2,184
3
24
46

pyspark calculate mean of all columns in one line

2 Answers2