Came across this seemingly odd behaviour while discussing https://stackoverflow.com/a/47543066/9017455.
The OP had this dataframe:
x = pd.DataFrame.from_dict({
'cat1':['A', 'A', 'A', 'B', 'B', 'C', 'C', 'C'],
'cat2':['X', 'X', 'Y', 'Y', 'Y', 'Y', 'Z', 'Z']})
and wanted to find unique cat2
values for each group of cat1
values.
One option is to aggregate and use a lambda to create a set of unique values:
x.groupby('cat1').agg(lambda x: set(x))
# Returns
cat2
cat1
A {X, Y}
B {Y}
C {Z, Y}
I assumed using set
on its own would be equivalent to the lambda here, since it is callable, however:
x.groupby('cat1').agg(set)
# Returns
cat2
cat1
A {cat1, cat2}
B {cat1, cat2}
C {cat1, cat2}
I get the same behaviour as the lambda
method if I define a proper function, and by doing that I can see that pandas calls that function with a Series
. It appears that set
is being called with a DataFrame
, hence it returns the set of column names when iterating over the object.
This seems like inconsistent behaviour. Can anyone shed some light on why Pandas treats the builtin functions differently?
Edit
Looking at how SeriesGroupBy.agg
behaves might provide some more insight. Passing any type to this function results in an error "TypeError: 'type' object is not iterable".
x.groupby('cat1')['cat2'].agg(set)