1

I have noticed that in some cases with pandas 0.16.1, the apply() function on groupby() is being applied more than once to one or more of the output groups. Here is a reproduction:

In [1]: 
df2 = DataFrame ({"a" : ["alpha", "alpha", "alpha", "beta","beta","beta","beta","gamma"]})
df2 ["b"] = Series ([i for i in range(0,len(df2))])
df2

Out [1]:
    a   b
0   alpha   0
1   alpha   1
2   alpha   2
3   beta    3
4   beta    4
5   beta    5
6   beta    6
7   gamma   7

In [2]: 
def my_func (df):
    print(df.index)

In [3]: 
df2.groupby("a").apply(my_func)

Out [3]:
Int64Index([0, 1, 2], dtype='int64')
Int64Index([0, 1, 2], dtype='int64')
Int64Index([3, 4, 5, 6], dtype='int64')
Int64Index([7], dtype='int64')

Notice the [0,1,2] index appearing twice in the output. This would seem to indicate that the function was applied to the alpha group twice.

This is not a huge issue, since it's good practice for these functions to be idempotent in the first place. However, if the functions are costly in terms of runtime (think big regression runs, etc.), it can be more of a problem.

Am I using the API incorrectly and/or misinterpreting this output, or is there a possible issue here?

sparc_spread
  • 10,643
  • 11
  • 45
  • 59

2 Answers2

1

According to the doc (http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.apply.html)

In the current implementation apply calls func twice on the first column/row to decide whether it can take a fast or slow code path.

stellasia
  • 5,372
  • 4
  • 23
  • 43
1

It is documented behavior:

In the current implementation apply calls func twice on the first group to decide whether it can take a fast or slow code path. This can lead to unexpected behavior if func has side-effects, as they will take effect twice for the first group.

BrenBarn
  • 242,874
  • 37
  • 412
  • 384