I have noticed that in some cases with pandas 0.16.1, the apply()
function on groupby()
is being applied more than once to one or more of the output groups. Here is a reproduction:
In [1]:
df2 = DataFrame ({"a" : ["alpha", "alpha", "alpha", "beta","beta","beta","beta","gamma"]})
df2 ["b"] = Series ([i for i in range(0,len(df2))])
df2
Out [1]:
a b
0 alpha 0
1 alpha 1
2 alpha 2
3 beta 3
4 beta 4
5 beta 5
6 beta 6
7 gamma 7
In [2]:
def my_func (df):
print(df.index)
In [3]:
df2.groupby("a").apply(my_func)
Out [3]:
Int64Index([0, 1, 2], dtype='int64')
Int64Index([0, 1, 2], dtype='int64')
Int64Index([3, 4, 5, 6], dtype='int64')
Int64Index([7], dtype='int64')
Notice the [0,1,2]
index appearing twice in the output. This would seem to indicate that the function was applied to the alpha
group twice.
This is not a huge issue, since it's good practice for these functions to be idempotent in the first place. However, if the functions are costly in terms of runtime (think big regression runs, etc.), it can be more of a problem.
Am I using the API incorrectly and/or misinterpreting this output, or is there a possible issue here?