why does groupby function returns duplicated data

Question

I am testing pandas.groupby function and have generated a random dataframe

df = pd.DataFrame(np.random.randint(5,size=(6,3)), columns=list('abc'))

in a random case df is:

when I use the following code to print out the groupby object, I get some interesting results.

def func(x):
    print(x)
df.groupby("a").apply(lambda x: func(x))

   a  b  c
0  0  1  4
   a  b  c
0  0  1  4
   a  b  c
2  2  4  1
3  2  2  1
   a  b  c
1  4  0  0
4  4  4  3

Could anybody let me know why index 0 appear twice in this case?

To avoid this behavior, and also to prevent having an output of `None`, you can also just iterate over the groups `for idx,grp in df.groupby('a'): print(grp)` — G. Anderson, Jul 23 '19 at 17:48
Yes, that could work. I just have a very big table and I think .apply iterates faster than a for loop — Y. Peng, Jul 24 '19 at 18:05

score 2 · Accepted Answer · answered Jul 23 '19 at 17:42

2

DataFrame.groupby.apply evaluates the first group twice to determine whether a fast path for calculation can be followed for the remaining groups. This behavior has changed in recent versions of pandas as discussed here

answered Jul 23 '19 at 17:42

jeschwar

1,286
7
10

Interesting... Thanks a lot for the answer – Y. Peng Jul 24 '19 at 18:03
@Y.Peng you're welcome; please mark this as the *accepted answer* if you feel this answers your question – jeschwar Jul 24 '19 at 20:55

why does groupby function returns duplicated data

1 Answers1