2

I have this question is an extension after reading the "Python pandas groupby object apply method duplicates first group".

I get the answer, and tried some experiments on my own, e.g.:

import pandas as pd
from cStringIO import StringIO
s = '''c1 c2 c3
1 2 3
4 5 6'''
df = pd.read_csv(StringIO(s), sep=' ')
print df
def f2(df):
    print df.iloc[:]
    print "--------"
    return df.iloc[:]
df2 = df.groupby(['c1']).apply(f2)
print "======"
print df2

gives as expected:

   c1  c2  c3
0   1   2   3
1   4   5   6
   c1  c2  c3
0   1   2   3
--------
   c1  c2  c3
0   1   2   3
--------
   c1  c2  c3
1   4   5   6
--------
======
   c1  c2  c3
0   1   2   3
1   4   5   6

However, when I try to return only df.iloc[0]:

def f3(df):
    print df.iloc[0:]
    print "--------"
    return df.iloc[0:]
df3 = df.groupby(['c1']).apply(f3)
print "======"
print df3

, I get an additional index:

   c1  c2  c3
0   1   2   3
--------
   c1  c2  c3
0   1   2   3
--------
   c1  c2  c3
1   4   5   6
--------
======
      c1  c2  c3
c1              
1  0   1   2   3
4  1   4   5   6

I did some search and suspect this may mean there is a different code path taken?

Community
  • 1
  • 1
ntg
  • 12,950
  • 7
  • 74
  • 95

1 Answers1

3

The difference is that iloc[:] returns the object itself, while iloc[0:] returns a view of the object. Take a look at this:

>>> df.iloc[:] is df
True

>>> df.iloc[0:] is df
False

Where this makes a difference is that within the groupby, each group has a name attribute that reflects the grouping. When your function returns an object with this name attribute, no index is added to the result, while if you return an object without this name attribute, an index is added to track which group each came from.

Interestingly, you can force the iloc[:] behavior for iloc[0:] by explicitly setting the name attribute of the group before returning:

def f(x):
    out = x.iloc[0:]
    out.name = x.name
    return out

df.groupby('c1').apply(f)
#    c1  c2  c3
# 0   1   2   3
# 1   4   5   6

My guess is that the no-index behavior with named output is basically a special case meant to make df.groupby(col).apply(lambda x: x) be a no-op.

jakevdp
  • 77,104
  • 11
  • 125
  • 160
  • Seems exactly right (also tried out= x.iloc[0:1]; out.name = x.name , and got the extra index). Also, cool video on the Scikit-Learn, you rock :) – ntg Nov 05 '15 at 14:53
  • Also tried out= x.iloc[0:1]; out.name = x.name , and got the extra index, but only if the returned result would differ when there are duplicate c1 values. – ntg Nov 05 '15 at 15:07