40

In pandas.DataFrame.groupby, there is an argument group_keys, which I gather is supposed to do something relating to how group keys are included in the dataframe subsets. According to the documentation:

group_keys : boolean, default True

When calling apply, add group keys to index to identify pieces

However, I can't really find any examples where group_keys makes an actual difference:

import pandas as pd

df = pd.DataFrame([[0, 1, 3],
                   [3, 1, 1],
                   [3, 0, 0],
                   [2, 3, 3],
                   [2, 1, 0]], columns=list('xyz'))

gby = df.groupby('x')
gby_k = df.groupby('x', group_keys=False)

It doesn't make a difference in the output of apply:

ap = gby.apply(pd.DataFrame.sum)
#    x  y  z
# x         
# 0  0  1  3
# 2  4  4  3
# 3  6  1  1

ap_k = gby_k.apply(pd.DataFrame.sum)
#    x  y  z
# x         
# 0  0  1  3
# 2  4  4  3
# 3  6  1  1

And even if you print out the grouped subsets as you go, the results are still identical:

def printer_func(x):
    print(x)
    return x

print('gby')
print('--------------')
gby.apply(printer_func)
print('--------------')

print('gby_k')
print('--------------')
gby_k.apply(printer_func)
print('--------------')

# gby
# --------------
#    x  y  z
# 0  0  1  3
#    x  y  z
# 0  0  1  3
#    x  y  z
# 3  2  3  3
# 4  2  1  0
#    x  y  z
# 1  3  1  1
# 2  3  0  0
# --------------
# gby_k
# --------------
#    x  y  z
# 0  0  1  3
#    x  y  z
# 0  0  1  3
#    x  y  z
# 3  2  3  3
# 4  2  1  0
#    x  y  z
# 1  3  1  1
# 2  3  0  0
# --------------

I considered the possibility that the default argument is actually True, but switching group_keys to explicitly False doesn't make a difference either. What exactly is this argument for?

(Run on pandas version 0.18.1)

Edit: I did find a way where group_keys changes behavior, based on this answer:

import pandas as pd
import numpy as np

row_idx = pd.MultiIndex.from_product(((0, 1), (2, 3, 4)))
d = pd.DataFrame([[4, 3], [1, 3], [1, 1], [2, 4], [0, 1], [4, 2]], index=row_idx)

df_n = d.groupby(level=0).apply(lambda x: x.nlargest(2, [0]))
#        0  1
# 0 0 2  4  3
#     3  1  3
# 1 1 4  4  2
#     2  2  4

df_k = d.groupby(level=0, group_keys=False).apply(lambda x: x.nlargest(2, [0]))

#      0  1
# 0 2  4  3
#   3  1  3
# 1 4  4  2
#   2  2  4

However, I'm still not clear on the intelligible principle behind what group_keys is supposed to do. This behavior does not seem intuitive based on @piRSquared's answer.

Community
  • 1
  • 1
Paul
  • 10,381
  • 13
  • 48
  • 86

4 Answers4

12

group_keys parameter in groupby comes handy during apply operations that creates an additional index column corresponding to the grouped columns (group_keys=True) and eliminates in the case (group_keys=False) especially during the case when trying to perform operations on individual columns.

One such instance:

In [21]: gby = df.groupby('x',group_keys=True).apply(lambda row: row['x'])

In [22]: gby
Out[22]: 
x   
0  0    0
2  3    2
   4    2
3  1    3
   2    3
Name: x, dtype: int64

In [23]: gby_k = df.groupby('x', group_keys=False).apply(lambda row: row['x'])

In [24]: gby_k
Out[24]: 
0    0
3    2
4    2
1    3
2    3
Name: x, dtype: int64

One of its intended applications could be to group by one of the levels of the hierarchy by converting it to a Multi-index dataframe object.

In [27]: gby.groupby(level='x').sum()
Out[27]: 
x
0    0
2    4
3    6
Name: x, dtype: int64
Michel de Ruiter
  • 7,131
  • 5
  • 49
  • 74
Nickil Maveli
  • 29,155
  • 8
  • 82
  • 85
  • 1
    Hmmm.. I still feel like I don't have a sense of what `group_key` is intending here. Like... why does it have this specific behavior, **only** when you have grouped columns? Seems like it only creates a multi-index when the `apply` function returns a `Series`, but I don't understand why. – Paul Aug 09 '16 at 19:45
7

If you are passing a function that preserves an index, pandas tries to keep that information. But if you pass a function that removes all semblance of index information, group_keys=True allows you to keep that information.

Use this instead

f = lambda df: df.reset_index(drop=True)

Then the different groupby

gby.apply(lambda df: df.reset_index(drop=True))

enter image description here

gby_k.apply(lambda df: df.reset_index(drop=True))

enter image description here

piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Thanks for this! Is this the full extent of what `group_keys` does? I've edited the question with another example of where `group_keys` does something, but it does not seem consistent with the meaning of `group_keys` you've articulated here. – Paul Aug 09 '16 at 19:39
1

It confused me as well. Here are some "notes to self" that might help others.

The only group_keys difference is in the output of apply (if it is so-called 'transform-like', that is).

The input to the passed function does not change: its index always includes the group keys! One can .reset_index(group_key_levels_to_drop, drop=True) if needed.

By default, currently (as of pandas version 1.5.3) group keys are not prepended to the index of DataFrame results. In the future they will be (as already happens for Series results). Due to this upcoming change in default behavior, not specifying an explicit group_keys= for DataFrame results currently shows a FutureWarning:

Not prepending group keys to the result index of transform-like apply. ...

Aside: if group_keys=True (or by default for Series results), also including as_index=False causes the prepended index to be group index numbers (0, 1, ...). Apart from backwards compatibility with versions that had this behavior by default, I cannot think of any reason to do that. The same holds for the obsolete squeeze=True to convert a one-column DataFrame result into a Series.

Michel de Ruiter
  • 7,131
  • 5
  • 49
  • 74
0

Such a convoluted documentation. Answer is simple (applicable only for groupby, followed by apply):

Condition1 When the result set length is same as the original df

  1. a) If the result set is ordered by the group, group_keys=True will add the group key.
    Example: df.groupby(...).apply(lambda df: df[0] + df[1]) # results are ordered by their specific group
    b) If the result set is ordered by the original index, then there is no need for the library to specify group key as the original order is still retained.
    Example: df.groupby(..).apply(lambda df: df + 1) # results are in the original order

Condition2

  1. When result set length is not the same as original length, then group key is always included.
    Example: df.groupby(...).apply(lambda x: x.mean()) # results length is changed/reduced, group_keys has no effect
Michel de Ruiter
  • 7,131
  • 5
  • 49
  • 74