I am looking into implementing a UserDefinedAggregateFunction
in spark and see that a bufferSchema
is needed. I understand how to create it, but my issue is why does it require a bufferSchema? Should it not only need a size (number of elements for use in aggregation), an inputSchema and a dataType? Doesn't a bufferSchema constrain it to UserDefinedTypes
in the intermediate steps in sql?
Asked
Active
Viewed 204 times
2

Ghastone
- 75
- 4
1 Answers
1
This is needed because the buffer schema can differ from the input type. For example if you want to calculate the average (arithmetic mean) of doubles, the buffer needs a count and a sum in this case See e.g. the example from databricks how to calculate the geometric mean : https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html

Raphael Roth
- 26,751
- 15
- 88
- 145
-
But why does it need to be specified in the schema? Why can a zero element not be specified as a tuple of (count, sum) such as in regular rdd aggregation? Why is it constrained to a sql schema? For instance, lets say we want to have a mutable.Seq as our aggregator, if we constrain to a schema, that means we will have to recreate the array every time as its immutable (I believe it being a schema wraps it in a immutable WrappedArray) – Ghastone Aug 13 '19 at 18:37