1

I have a pyspark dataframe that I am grouping by one column, and then I would like to apply several different aggregation functions, including some custom ones, to different columns. So basically what I'd like to do is this (I know the syntax is all wrong, it's just an illustration of what I'd like to do):

fraction  = UserDefinedFunction(lambda x: sum(x)*100/count(col4),DoubleType())
exprs = {x: "sum" for x in [col1,col2,col3]; x: "avg" for x in [col1,col3]; x: "fraction" for x in [col1,col2]}


df1 = df.groupBy(col5).agg(*exprs)

I tried different versions of this, such as agg(sum(df.col1,df.col2,df.col3), avg(df.col1,df.col3), fraction(df.col1,df.col2)), but nothing works.

I'd appreciate your help!

desertnaut
  • 57,590
  • 26
  • 140
  • 166
user3490622
  • 939
  • 2
  • 11
  • 30

1 Answers1

1

Aggregations can be defined as a list and then used with star operators to unpack the list for agg(). In case of custom functions - unfortunatelly pyspark does not support writing User Defined Aggregation Functions yet (see this answer), but in your case you can just combine standard functions to achieve the same effect:

fraction = lambda col: sum(col) / count('col4')
aggs = [sum(x) for x in ['col1', 'col2', 'col3']]  \
    + [avg(x) for x in ['col1', 'col3']] \
    + [fraction(x) for x in ['col1', 'col2']]

df1.groupBy('col5').agg(*aggs).show()
Mariusz
  • 13,481
  • 3
  • 60
  • 64