2

I am looking into implementing a UserDefinedAggregateFunction in spark and see that a bufferSchema is needed. I understand how to create it, but my issue is why does it require a bufferSchema? Should it not only need a size (number of elements for use in aggregation), an inputSchema and a dataType? Doesn't a bufferSchema constrain it to UserDefinedTypes in the intermediate steps in sql?

Ghastone
  • 75
  • 4

1 Answers1

1

This is needed because the buffer schema can differ from the input type. For example if you want to calculate the average (arithmetic mean) of doubles, the buffer needs a count and a sum in this case See e.g. the example from databricks how to calculate the geometric mean : https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html

Raphael Roth
  • 26,751
  • 15
  • 88
  • 145
  • But why does it need to be specified in the schema? Why can a zero element not be specified as a tuple of (count, sum) such as in regular rdd aggregation? Why is it constrained to a sql schema? For instance, lets say we want to have a mutable.Seq as our aggregator, if we constrain to a schema, that means we will have to recreate the array every time as its immutable (I believe it being a schema wraps it in a immutable WrappedArray) – Ghastone Aug 13 '19 at 18:37