1

I want make an user defined aggregate function in pyspark. I found some documentation for Scala and would like to achieve something similar in Python.

To be more specific, assume I already have a function like this implemented:

def process_data(df: pyspark.sql.DataFrame) -> bytes:
  ...  # do something very complicated here

and now I would like to be able to do something like:

source_df.groupBy("Foo_ID").agg(UDAF(process_data))

Now the question is - what should I put in place of UDAF?

user344577
  • 127
  • 6
  • 1
    Does this answer your question? [Applying UDFs on GroupedData in PySpark (with functioning python example)](https://stackoverflow.com/questions/40006395/applying-udfs-on-groupeddata-in-pyspark-with-functioning-python-example) – Jonathan Lam Sep 23 '22 at 09:26