5

Came across this seemingly odd behaviour while discussing https://stackoverflow.com/a/47543066/9017455.

The OP had this dataframe:

x = pd.DataFrame.from_dict({
    'cat1':['A', 'A', 'A', 'B', 'B', 'C', 'C', 'C'],
    'cat2':['X', 'X', 'Y', 'Y', 'Y', 'Y', 'Z', 'Z']})

and wanted to find unique cat2 values for each group of cat1 values.

One option is to aggregate and use a lambda to create a set of unique values:

x.groupby('cat1').agg(lambda x: set(x))

# Returns
        cat2
cat1        
A     {X, Y}
B        {Y}
C     {Z, Y}

I assumed using set on its own would be equivalent to the lambda here, since it is callable, however:

x.groupby('cat1').agg(set)

# Returns
              cat2
cat1              
A     {cat1, cat2}
B     {cat1, cat2}
C     {cat1, cat2}

I get the same behaviour as the lambda method if I define a proper function, and by doing that I can see that pandas calls that function with a Series. It appears that set is being called with a DataFrame, hence it returns the set of column names when iterating over the object.

This seems like inconsistent behaviour. Can anyone shed some light on why Pandas treats the builtin functions differently?

Edit

Looking at how SeriesGroupBy.agg behaves might provide some more insight. Passing any type to this function results in an error "TypeError: 'type' object is not iterable".

x.groupby('cat1')['cat2'].agg(set)
Simon Bowly
  • 1,003
  • 5
  • 10
  • Interestingly, no `TypeError` is raised if you map all `cat2` values to integers (like `x['cat2'] = x['cat2'].map({'X': 1, 'Y': 2, 'Z': 3})`). Both `x.groupby('cat1').agg(sum)` and `x.groupby('cat1').agg(lambda x: sum(x))` return the same result. I don't have an answer for you, but this is some other behavior I observed which makes me wonder about the `set` function itself and why it behaves in such a peculiar manner. – blacksite Nov 29 '17 at 01:08
  • @blacksite: It wouldn't be `set`, but rather `pandas` special-casing `set` somehow. `set` doesn't have any magic of its own that could make this happen. I'm having a hard time finding how it does this, but it clearly does (it's not built-ins in general BTW, if you pass it `functools.partial(set)`, which makes a straight wrapper of `set` with no actual changes, it behaves just like the `lambda`; I'm guessing somewhere in the code there is a place that recognizes the `set` constructor and optimizes it, possibly incorrectly in this case). – ShadowRanger Nov 29 '17 at 01:51
  • Weirdly, passing `frozenset` behaves like `set`, but the result displays as `tuple` literals, not `set` literals... – ShadowRanger Nov 29 '17 at 01:52
  • @ShadowRanger pandas seems to be special-casing any types passed to agg, and then handling them all differently (see update to my question). The code seems to recognise a type as distinct from a function, but then handles it incorrectly... – Simon Bowly Nov 29 '17 at 01:57
  • 2
    Is this help ? https://stackoverflow.com/questions/37572611/pandas-groupby-and-make-set-of-items, If so, I will mark as dup – BENY Nov 29 '17 at 02:12
  • It does, thanks! I think my confusion comes from pandas checking for `function` and not `callable` in these cases. – Simon Bowly Nov 29 '17 at 02:26

1 Answers1

1

This behaviour seems to have changed by now. At least here in version 0.23.0, both lambda x: set(x) and set behave identically:

In [6]: x.groupby('cat1').agg(set)
Out[6]:
        cat2
cat1
A     {Y, X}
B        {Y}
C     {Y, Z}

In [7]: x.groupby('cat1').agg(lambda x: set(x))
Out[7]:
        cat2
cat1
A     {Y, X}
B        {Y}
C     {Y, Z}

I could not positively identify the change, but bug #16405 looks suspiciously related (although the fix was already released with 0.20.2 in June 2017, long before this question...).

ojdo
  • 8,280
  • 5
  • 37
  • 60
  • do you have any suggestions for using the `list` aggregation (instead of `set`), for which I get the same error (" "TypeError: 'type' object is not iterable")? I am constrained with pandas 0.20.3. – Tanguy Jul 06 '20 at 12:46
  • @Tanguy: what have you tried? The lambda trick works for me, even in 0.20.3 (I checked). Simply swap `set` by a call to `list`. – ojdo Jul 06 '20 at 14:27
  • Switching to the lambda formulation worked with `set`, but not for `list` (using `lambda x: list(x)`), for which I got this error instead `ValueError: Function does not reduce` (but resorting to `set` was OK for me). – Tanguy Jul 09 '20 at 18:24