18

Answering this question it turned out that df.groupby(...).agg(set) and df.groupby(...).agg(lambda x: set(x)) are producing different results.

Data:

df = pd.DataFrame({
       'user_id': [1, 2, 3, 4, 1, 2, 3], 
       'class_type': ['Krav Maga', 'Yoga', 'Ju-jitsu', 'Krav Maga', 
                      'Ju-jitsu','Krav Maga', 'Karate'], 
       'instructor': ['Bob', 'Alice','Bob', 'Alice','Alice', 'Alice','Bob']})

Demo:

In [36]: df.groupby('user_id').agg(lambda x: set(x))
Out[36]:
                    class_type    instructor
user_id
1        {Krav Maga, Ju-jitsu}  {Alice, Bob}
2            {Yoga, Krav Maga}       {Alice}
3           {Ju-jitsu, Karate}         {Bob}
4                  {Krav Maga}       {Alice}

In [37]: df.groupby('user_id').agg(set)
Out[37]:
                                class_type                         instructor
user_id
1        {user_id, class_type, instructor}  {user_id, class_type, instructor}
2        {user_id, class_type, instructor}  {user_id, class_type, instructor}
3        {user_id, class_type, instructor}  {user_id, class_type, instructor}
4        {user_id, class_type, instructor}  {user_id, class_type, instructor}

I would expect the same behaviour here - do you know what I am missing?

NOhs
  • 2,780
  • 3
  • 25
  • 59
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • 2
    Related: [Pandas groupby and make set of items](https://stackoverflow.com/questions/37572611/pandas-groupby-and-make-set-of-items) – jpp Mar 28 '18 at 14:40
  • 1
    I think this is because when you pass just `set` this will call the iterable on the object which in this case will be the columns, hence why you get this weird result. When you do this with a `lambda` this will call the set ctor on the series values – EdChum Mar 28 '18 at 14:43
  • @EdChum, thank you! It looks like it happens the way you have describied it, but i don't get why `set` is applied to `df` instead of a single column... – MaxU - stand with Ukraine Mar 28 '18 at 14:49
  • @jpp, thank you for the link! – MaxU - stand with Ukraine Mar 28 '18 at 14:49
  • 1
    Give me 10 minutes and I should have a definitive answer, I'm in the middle of stepping through the source code – EdChum Mar 28 '18 at 15:33
  • @EdChum, sure, take your time... and thank you! :) – MaxU - stand with Ukraine Mar 28 '18 at 15:34
  • 1
    My findings so far: `.agg(set)` ends up calling `pd.core.groupby.NDFrameGroupBy._aggregate_generic`, whereas `.add(lambda x: set(x))` ends up calling `pd.core.groupby._GroupBy._python_agg_general`. Both functions can be called with `set` or `lambda x: set(x)` (i.e. `._aggregate_generic(set)`/`._aggregate_generic(lambda x: set(x))` and `._python_agg_general(set)`/`._python_agg_general(lambda x: set(x))`), and each function produces the same result in both cases, but I haven't found out where/why the decision to call one or another is made. – jdehesa Mar 28 '18 at 15:36
  • 2
    I've spent 2 hours looking through `pandas` source code. Frustratingly opaque. Very few comments. I'll award a bounty to an answer which gets to the bottom of this one via the source code. – jpp Mar 28 '18 at 15:41
  • 1
    I've posted an answer which shows why calling `agg(set)` fails basically this situation isn't handled whilst `list, dict, and tuple` are – EdChum Mar 28 '18 at 15:55

2 Answers2

12

OK what is happening here is that set isn't being handled as it's not is_list_like in _aggregate:

elif is_list_like(arg) and arg not in compat.string_types:

see source

this isn't is_list_like so it returns None up the call chain to end up at this line:

results.append(colg.aggregate(a))

see source

this raises TypeError as TypeError: 'type' object is not iterable

which then raises:

if not len(results):
    raise ValueError("no results")

see source

so because we have no results we end up calling _aggregate_generic:

see source

this then calls:

result[name] = self._try_cast(func(data, *args, **kwargs)

see source

This then ends up as:

(Pdb) n
> c:\programdata\anaconda3\lib\site-packages\pandas\core\groupby.py(3779)_aggregate_generic()
-> return self._wrap_generic_output(result, obj)

(Pdb) result
{1: {'user_id', 'instructor', 'class_type'}, 2: {'user_id', 'instructor', 'class_type'}, 3: {'user_id', 'instructor', 'class_type'}, 4: {'user_id', 'instructor', 'class_type'}}

I'm running a slightly different version of pandas but the equivalent source line is https://github.com/pandas-dev/pandas/blob/v0.22.0/pandas/core/groupby.py#L3779

So essentially because set doesn't count as a function or an iterable, it just collapses to calling the ctor on the series iterable which in this case are the columns, you can see the same effect here:

In [8]:

df.groupby('user_id').agg(lambda x: print(set(x.columns)))
{'class_type', 'instructor', 'user_id'}
{'class_type', 'instructor', 'user_id'}
{'class_type', 'instructor', 'user_id'}
{'class_type', 'instructor', 'user_id'}
Out[8]: 
        class_type instructor
user_id                      
1             None       None
2             None       None
3             None       None
4             None       None

but when you use the lambda which is an anonymous function this works as expected.

EdChum
  • 376,765
  • 198
  • 813
  • 562
  • 1
    @EdChum, bounty coming your way in a couple of days. A few comments in the `pandas` source would have been so helpful. Felt like a little merry-go-round. – jpp Mar 28 '18 at 16:12
  • 2
    Awesome work. Maybe consider replacing the first five links pointing to `master` with hardcoded tags/commits, so they still make sense in the future. I don't get the `is_aggregator` part, that seems to apply only when `isinstance(arg, dict)`? – jdehesa Mar 28 '18 at 16:19
  • 1
    So does this mean it's a bug? Or intended but undocumented behavior? – Aran-Fey Mar 28 '18 at 16:34
  • @Aran-Fey I think this is a bug it should be posted as an issue on [github](https://github.com/pandas-dev/pandas/issues) – EdChum Mar 28 '18 at 18:46
  • @jdehesa at the beginning of that function it tries to establish if it's `is_aggregator `, then when it comes to the main body it then checks if it's a `dict` otherwise it executes the alternate branch. The reason is that the data may need to be coerced to be pandas friendly if the calling func is a `dict`, here it fails on the `is_list_like` in the alternate branch, so this returns `None` and it get's handled higher up but it doesn't really handle it as expected. Also the links point to the current released version, not sure if pointing at master is helpful for future readers – EdChum Mar 28 '18 at 18:57
  • @EdChum Thanks. What I meant is `is_aggregator` is a lambda which seems to be called only in the `isinstance(arg, dict)`... About the links, the first ones point to master, I was suggesting changing all to current release. – jdehesa Mar 28 '18 at 19:03
  • 1
    @jdehesa ah OK, I misunderstood, I think the `is_aggregator` is to handle if the values are of any of those types and if so to preserve ordering by constructing an orderedDict, I'll update my links to `0.22.0` – EdChum Mar 28 '18 at 19:12
  • 1
    @jdehesa I've edited out the first part as it's irrelevant as you've pointed out, I spent an hour stepping through using `pdb` before getting to the juicy bit and then re-pieced the call stack which is why that bit was in the original post of my answer – EdChum Mar 28 '18 at 19:38
  • @EdChum, can you share how you narrowed down to the exact execution steps? For example, were you using a library or IDE built-in (like VS Step In function for C coding?) that actually showed you which functions it called step by step? I did try to read the source code but there must be a better way. – B.Mr.W. Dec 20 '18 at 03:20
  • @B.Mr.W. I use ipython and pdb nothing fancy – EdChum Dec 20 '18 at 09:18
2

Perhaps as @Edchum commented agg applies the python builtin functions considering the groupby object as a mini dataframe, whereas when a defined function is passed it applies it for every column. An example to illustrate this is via print.

df.groupby('user_id').agg(print,end='\n\n')

 class_type instructor  user_id
0  Krav Maga        Bob        1
4   Ju-jitsu      Alice        1

  class_type instructor  user_id
1       Yoga      Alice        2
5  Krav Maga      Alice        2

  class_type instructor  user_id
2   Ju-jitsu        Bob        3
6     Karate        Bob        3


df.groupby('user_id').agg(lambda x : print(x,end='\n\n'))

0    Krav Maga
4     Ju-jitsu
Name: class_type, dtype: object

1         Yoga
5    Krav Maga
Name: class_type, dtype: object

2    Ju-jitsu
6      Karate
Name: class_type, dtype: object

3    Krav Maga
Name: class_type, dtype: object

...

Hope this is the reason why applying set gave the result like the one mentioned above.

Bharath M Shetty
  • 30,075
  • 6
  • 57
  • 108