I want make an user defined aggregate function in pyspark. I found some documentation for Scala and would like to achieve something similar in Python.
To be more specific, assume I already have a function like this implemented:
def process_data(df: pyspark.sql.DataFrame) -> bytes:
... # do something very complicated here
and now I would like to be able to do something like:
source_df.groupBy("Foo_ID").agg(UDAF(process_data))
Now the question is - what should I put in place of UDAF
?