Is there a way to get the nlargest items per group in dask?

Question

I have the following dataset:

location  category    percent
A         5           100.0
B         3           100.0
C         2            50.0
          4            13.0
D         2            75.0
          3            59.0
          4            13.0
          5             4.0

And I'm trying to get the nlargest items of category in dataframe grouped by location. i.e. If I want the top 2 largest percentages for each group the output should be:

location  category    percent
A         5           100.0
B         3           100.0
C         2            50.0
          4            13.0
D         2            75.0
          3            59.0

It looks like in pandas this is relatively straight forward using pandas.core.groupby.SeriesGroupBy.nlargest but dask doesn't have an nlargest function for groupby. Have been playing around with apply but can't seem to get it to work properly.

df.groupby(['location'].apply(lambda x: x['percent'].nlargest(2)).compute()

But I just get the error ValueError: Wrong number of items passed 0, placement implies 8

in your data set which one is the index ? – BENY Nov 10 '17 at 17:14 — BENY, Nov 10 '17 at 17:14

score 3 · Accepted Answer · answered Nov 10 '17 at 17:24

The apply should work, but your syntax is a little off:

In [11]: df
Out[11]:
Dask DataFrame Structure:
              Unnamed: 0 location category  percent
npartitions=1
                   int64   object    int64  float64
                     ...      ...      ...      ...
Dask Name: from-delayed, 3 tasks

In [12]: df.groupby("location")["percent"].apply(lambda x: x.nlargest(2), meta=('x', 'f8')).compute()
Out[12]:
location
A         0    100.0
B         1    100.0
C         2     50.0
          3     13.0
D         4     75.0
          5     59.0
Name: x, dtype: float64

In pandas you'd have .nlargest and .rank as groupby methods which would let you do this without the apply:

In [21]: df1
Out[21]:
  location  category  percent
0        A         5    100.0
1        B         3    100.0
2        C         2     50.0
3        C         4     13.0
4        D         2     75.0
5        D         3     59.0
6        D         4     13.0
7        D         5      4.0

In [22]: df1.groupby("location")["percent"].nlargest(2)
Out[22]:
location
A         0    100.0
B         1    100.0
C         2     50.0
          3     13.0
D         4     75.0
          5     59.0
Name: percent, dtype: float64

The dask documentation notes:

Dask.dataframe covers a small but well-used portion of the pandas API.
This limitation is for two reasons:

The pandas API is huge

Some operations are genuinely hard to do in parallel (for example sort).

It could be that rank/cumcount would sit under cleverly parallelizable, so it might be worth making a feature request to dask to support these on the groupby. I think _maybe_ I implemented a faster path for nlargest in pandas (previously you had to use apply or slice on the rank). — Andy Hayden, Nov 10 '17 at 17:30
Now I think about it, nlargest is much easier to parallelize than rank (especially if n is small), so that'll be why rank is not implemented in dask. I still think this could be a good dask feature request... — Andy Hayden, Nov 10 '17 at 17:36
Thanks, to get category back in the final dataframe, would I merge results on the "location" column? — whisperstream, Nov 10 '17 at 17:50
@whisperstream something like this? `df.groupby('location')[["category", "percent"]].apply(lambda x: x.nlargest(2, columns=["percent"]), meta={"category": 'f8', "percent": 'f8'}).compute()` — Andy Hayden, Nov 10 '17 at 18:13

Is there a way to get the nlargest items per group in dask?

1 Answers1

Linked