3

I have a couple of questions on using groupby on dask dataframes. As I understand it, iterating on a groupby result like one does in Pandas doesn't work in dask i.e.

for name, group in sorted(grouped.groups):
    logger.info((name, group))

isn't allowed. We're supposed to use apply instead. However, in Pandas if I wanted to find out the number of groups I could do the following:

 len(grouped.groups)

By using apply, I would expect to be able to do this for a groupby on a dask dataframe:

 d_grouped.apply(len)

But that doesn't work. How can I find out the number of groups resulting from a groupby on a dask dataframe ?

femibyte
  • 3,317
  • 7
  • 34
  • 59
  • isn't the len of groups equivalent to the len of unique values in the resulting index? Just trying to find an alternative – Zeugma Nov 29 '16 at 03:08
  • 1
    Without knowing further details of your work, I am just making a guess here. Dask parallelizes your `apply` function on each group across multiple cores. You can achieve something similar using [this answer](http://stackoverflow.com/a/29281494/3765319). In this case, you can use the native capabilities and attributes of pandas `groupby` object, while applying your function in parallel. Note that this is not free of caveats. – Kartik Nov 29 '16 at 04:43

0 Answers0