2

I am testing pandas.groupby function and have generated a random dataframe

df = pd.DataFrame(np.random.randint(5,size=(6,3)), columns=list('abc'))

in a random case df is:

   a  b  c
0  2  2  2
1  1  4  2
2  3  0  1
3  2  1  3
4  0  2  2
5  2  1  4

when I use the following code to print out the groupby object, I get some interesting results.

def func(x):
    print(x)
df.groupby("a").apply(lambda x: func(x))

   a  b  c
0  0  1  4
   a  b  c
0  0  1  4
   a  b  c
2  2  4  1
3  2  2  1
   a  b  c
1  4  0  0
4  4  4  3

Could anybody let me know why index 0 appear twice in this case?

cs95
  • 379,657
  • 97
  • 704
  • 746
Y. Peng
  • 324
  • 1
  • 2
  • 7
  • To avoid this behavior, and also to prevent having an output of `None`, you can also just iterate over the groups `for idx,grp in df.groupby('a'): print(grp)` – G. Anderson Jul 23 '19 at 17:48
  • Yes, that could work. I just have a very big table and I think .apply iterates faster than a for loop – Y. Peng Jul 24 '19 at 18:05

1 Answers1

2

DataFrame.groupby.apply evaluates the first group twice to determine whether a fast path for calculation can be followed for the remaining groups. This behavior has changed in recent versions of pandas as discussed here

jeschwar
  • 1,286
  • 7
  • 10