different actions for different columns for pyspark aggregation

Question

I have a pyspark dataframe that I am grouping by one column, and then I would like to apply several different aggregation functions, including some custom ones, to different columns. So basically what I'd like to do is this (I know the syntax is all wrong, it's just an illustration of what I'd like to do):

fraction  = UserDefinedFunction(lambda x: sum(x)*100/count(col4),DoubleType())
exprs = {x: "sum" for x in [col1,col2,col3]; x: "avg" for x in [col1,col3]; x: "fraction" for x in [col1,col2]}


df1 = df.groupBy(col5).agg(*exprs)

I tried different versions of this, such as agg(sum(df.col1,df.col2,df.col3), avg(df.col1,df.col3), fraction(df.col1,df.col2)), but nothing works.

I'd appreciate your help!

score 1 · Answer 1 · answered Nov 04 '17 at 08:46

Aggregations can be defined as a list and then used with star operators to unpack the list for agg(). In case of custom functions - unfortunatelly pyspark does not support writing User Defined Aggregation Functions yet (see this answer), but in your case you can just combine standard functions to achieve the same effect:

fraction = lambda col: sum(col) / count('col4')
aggs = [sum(x) for x in ['col1', 'col2', 'col3']]  \
    + [avg(x) for x in ['col1', 'col3']] \
    + [fraction(x) for x in ['col1', 'col2']]

df1.groupBy('col5').agg(*aggs).show()

different actions for different columns for pyspark aggregation

1 Answers1