Why does MutableAggregationBuffer in UserDefinedAggregateFunction require a bufferSchema?

Question

I am looking into implementing a UserDefinedAggregateFunction in spark and see that a bufferSchema is needed. I understand how to create it, but my issue is why does it require a bufferSchema? Should it not only need a size (number of elements for use in aggregation), an inputSchema and a dataType? Doesn't a bufferSchema constrain it to UserDefinedTypes in the intermediate steps in sql?

score 1 · Answer 1 · answered Aug 13 '19 at 18:31

1

This is needed because the buffer schema can differ from the input type. For example if you want to calculate the average (arithmetic mean) of doubles, the buffer needs a count and a sum in this case See e.g. the example from databricks how to calculate the geometric mean : https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html

answered Aug 13 '19 at 18:31

Raphael Roth

26,751
15
88
145

But why does it need to be specified in the schema? Why can a zero element not be specified as a tuple of (count, sum) such as in regular rdd aggregation? Why is it constrained to a sql schema? For instance, lets say we want to have a mutable.Seq as our aggregator, if we constrain to a schema, that means we will have to recreate the array every time as its immutable (I believe it being a schema wraps it in a immutable WrappedArray) – Ghastone Aug 13 '19 at 18:37

Why does MutableAggregationBuffer in UserDefinedAggregateFunction require a bufferSchema?

1 Answers1