I have a data with the following scheme:
sourceip
destinationip
packets sent
And I want to calculate several aggregative fields out of this data and have the following schema:
ip
packets sent as sourceip
packets sent as destination
In the happy days of RDDs I could use aggregate
, define a map of {ip -> []}, and count the appearances in a corresponding array location.
In the Dataset/Dataframe aggregate is no longer available, instead UDAF could be used, unfortunately, from the experience I had with UDAF they are immutable, means they cannot be used (have to create a new instance on every map update) example + explanation here
on one hand, technically, I could convert the Dataset to RDD, aggregate etc and go back to dataset. Which I expect would result in performance degradation, as Datasets are more optimized. UDAFs are out of the question due to the copying.
Is there any other way to perform aggregations?