A custom dask GroupBy
Aggregation
is very handy, but I am having trouble to define one working for the most often value in a column.
What do I have:
So from the example here, we can define custom aggregate functions like this:
custom_sum = dd.Aggregation('custom_sum', lambda s: s.sum(), lambda s0: s0.sum())
my_aggregate = {
'A': custom_sum,
'B': custom_most_often_value, ### <<< This is the goal.
'C': ['max','min','mean'],
'D': ['max','min','mean']
}
col_name = 'Z'
ddf_agg = ddf.groupby(col_name).agg(my_aggregate).compute()
While this works for custom_sum
(as on the example page), the adaption to most often value could be like this (from the example here):
custom_most_often_value = dd.Aggregation('custom_most_often_value', lambda x:x.value_counts().index[0], lambda x0:x0.value_counts().index[0])
but it yields
ValueError: Metadata inference failed in `_agg_finalize`.
You have supplied a custom function and Dask is unable to
determine the type of output that that function returns.
Then I tried to find the meta
keyword in the dd.Aggregation
implementation to define it, but could not find it.. And the fact that it is not needed in the example of custom_sum
makes me think that the error is somewhere else..
So my question would be, how to get that mostly occuring value of a column in a df.groupby(..).agg(..)
. Thanks!