I am trying to use a bunch of predefined sql functions along with my own UDAF on a Spark dataframe in PySpark
@F.udf
def mode(v):
from collections import Counter
x = [w[0] for w in Counter(v).most_common(5)]
return x
funs = [mean, max, min, stddev, approxCountDistinct, mode]
columns = df.columns
expr = [f(col(c)) for f in funs for c in columns]
s = df.agg(*expr).collect()
When I try to use my udf along with other functions I get:
org.apache.spark.sql.AnalysisException: grouping expressions sequence is empty. Wrap '(avg(CAST(DBN
AS DOUBLE)) AS avg(DBN)
in windowing function(s) or wrap 'DBN
' in first() (or first_value) if you don't care which value you get.;;
But when I run:
funs = [mode]
columns = df.columns
expr = [f(collect_list(col(c))) for f in funs for c in columns]
s = df.agg(*expr).collect()
It gives the correct results but only for my UDF and not the other functions.
Is there a way I can combine the collect_list function into my udf so that I can run my udf along with other functions.