7

I would like to calculate the mean value of each column without specifying all the columns name.

So for example instead of doing:

res = df.select([mean('col1'), mean('col2')])

I would like to do something equivalent to:

res = df.select([mean('*')])

Is that possible?

roschach
  • 8,390
  • 14
  • 74
  • 124
  • Possible duplicate of [Apply a transformation to multiple columns pyspark dataframe](https://stackoverflow.com/questions/48452076/apply-a-transformation-to-multiple-columns-pyspark-dataframe) – pault Dec 19 '18 at 16:28
  • 8
    `df.select(*[mean(c).alias(c) for c in df.columns])` – pault Dec 19 '18 at 16:29
  • Thanks @pault. I am not sure it is a complete duplicate but maybe you can see better than me :) – roschach Dec 19 '18 at 16:33
  • the general concept of apply a function to every column using a list comprehension is duplicated. I'll see if I can find time to update the linked Q&A to be more generic. – pault Dec 19 '18 at 16:35
  • Maybe a better duplicate: https://stackoverflow.com/questions/33882894/sparksql-apply-aggregate-functions-to-a-list-of-column – pault Dec 19 '18 at 16:38
  • Pault - why do we have to add * before []? – cph_sto Dec 19 '18 at 17:44
  • 1
    @cph_sto that is to unpack the list to pass in each column expression as an argument. – pault Dec 19 '18 at 18:39
  • @pault completely understood. Thanks – cph_sto Dec 19 '18 at 19:39

2 Answers2

6

similar solution but maybe a little easier to read:

results= df.agg(*(avg(c).alias(c) for c in df.columns))

to easily retrieve the information use:

results.first().asDict()

it will be handy when you use it to fill NaN like:

df.na.fill(results.first().asDict())

no promotion just gratitude from my side: I picked that cool trick up in a wonderfull pyspark-class by Layla AI (Pyspark Essentials for Data Scientists)

Aku
  • 660
  • 6
  • 9
5

You can do it by

res  = df.select(*[f.mean(c).alias(c) for c in df.columns])
Rahul Kumar
  • 2,184
  • 3
  • 24
  • 46