From PySpark, I am trying to define a custom aggregator that is accumulating state . Is it possible in Spark 2.3 ?
AFAIK, it is now possible to define a custom UDAF in PySpark since Spark 2.3 (cf How to define and use a User-Defined Aggregate Function in Spark SQL?), by calling pandas_udf
with the PandasUDFType.GROUPED_AGG
keyword. However given that it is just taking a function as a parameter I don't think it is possible to carry state around during the aggregation.
From Scala, I see it is possible to have stateful aggregation by either extending UserDefinedAggregateFunction
or org.apache.spark.sql.expressions.Aggregator
, but is there a similar thing I can do on python-side only?