Duplicate group appearing in pandas.GroupByDataFrame.apply operation

Question

Consider a dataframe that looks like this:

>>> df
   A  B   C
0  1  4  10
1  2  5  11
2  3  6  12
3  1  7  13
4  2  8  14
5  3  9  15
6  1  4  16
7  2  5  17
8  3  6  18

Next, consider creating a DataFrameGroupBy object grouping the dataframe by the 'A' column using the pandas DataFrame.groupby function. Finally, we will apply the following user defined function to the DataFrameGroupBy object using the DataFrameGroupBy.apply method:

>>> def do_group_stuff(grp,grpname):
...     print(f"grpname: {grpname}")
...     grp.apply(lambda row: print(row),axis=1)
>>> df.groupby(['A']).apply(lambda grp: do_group_stuff(grp,grp.name))

I expect there to be three groups in the DataFrameGroupBy object corresponding to the three values seen in the 'A' column of df and the output to look something like this:

grpname: 1
A     1
B     4
C    10
Name: 0, dtype: int64
A     1
B     7
C    13
Name: 3, dtype: int64
A     1
B     4
C    16
Name: 6, dtype: int64
grpname: 2
A     2
B     5
C    11
Name: 1, dtype: int64
A     2
B     8
C    14
Name: 4, dtype: int64
A     2
B     5
C    17
Name: 7, dtype: int64
grpname: 3
A     3
B     6
C    12
Name: 2, dtype: int64
A     3
B     9
C    15
Name: 5, dtype: int64
A     3
B     6
C    18

But in reality the output looks like this:

grpname: 1
A     1
B     4
C    10
Name: 0, dtype: int64
A     1
B     7
C    13
Name: 3, dtype: int64
A     1
B     4
C    16
Name: 6, dtype: int64
grpname: 1
A     1
B     4
C    10
Name: 0, dtype: int64
A     1
B     7
C    13
Name: 3, dtype: int64
A     1
B     4
C    16
Name: 6, dtype: int64
grpname: 2
A     2
B     5
C    11
Name: 1, dtype: int64
A     2
B     8
C    14
Name: 4, dtype: int64
A     2
B     5
C    17
Name: 7, dtype: int64
grpname: 3
A     3
B     6
C    12
Name: 2, dtype: int64
A     3
B     9
C    15
Name: 5, dtype: int64
A     3
B     6
C    18

where the "1" group is repeate twice for some reason. Any ideas why this is the case?

I think you are seeing the optimization pass. This first pass is not recorded in the returned dataset, but the function is called, so prints show up twice. https://stackoverflow.com/a/25716383/6361531 — Scott Boston, Aug 01 '19 at 02:11

Duplicate group appearing in pandas.GroupByDataFrame.apply operation

0 Answers0