How to combine/chain sql functions with UDAFs in PySpark

Question

I am trying to use a bunch of predefined sql functions along with my own UDAF on a Spark dataframe in PySpark

    @F.udf
    def mode(v):
     from collections import Counter
     x = [w[0] for w in Counter(v).most_common(5)]
     return x

   funs = [mean, max, min, stddev, approxCountDistinct, mode]
   columns = df.columns
   expr = [f(col(c)) for f in funs for c in columns]

   s = df.agg(*expr).collect()

When I try to use my udf along with other functions I get: org.apache.spark.sql.AnalysisException: grouping expressions sequence is empty. Wrap '(avg(CAST(DBN AS DOUBLE)) AS avg(DBN) in windowing function(s) or wrap 'DBN' in first() (or first_value) if you don't care which value you get.;;

But when I run:

funs = [mode]
   columns = df.columns
   expr = [f(collect_list(col(c))) for f in funs for c in columns]

   s = df.agg(*expr).collect()

It gives the correct results but only for my UDF and not the other functions.

Is there a way I can combine the collect_list function into my udf so that I can run my udf along with other functions.

Which is your UDF? `mode`? I am seeing that `funs = [mode]` has only your UDF, so it will return values only for UDF — pissall, Nov 15 '19 at 04:12

score 0 · Answer 1 · answered Nov 15 '19 at 06:55

You are getting the error because you are using udf in aggregate funtion where you should be using UDAF. 1. You can either define your own UDAF by following How to define and use a User-Defined Aggregate Function in Spark SQL?, or 2. You could do aggregation manually and then pass to your udf. As you want to use collect_list before calling your udf, you can do something like:

@F.udf
    def mode(v):
     from collections import Counter
     x = [w[0] for w in Counter(v).most_common(5)]
     return x

funs = [mean, max, min, stddev, approxCountDistinct, mode]
my_funs = [mode]
expr = [f(collect_list(col(c))) if f in my_funs  else f(col(c)) for f in funs for c in columns]
s = df.agg(*expr).collect()

In the above code, collect_list is used to do aggregation before calling udf on the column.

How to combine/chain sql functions with UDAFs in PySpark

1 Answers1