5

When and why is the sort flag of a DataFrame grouping ignored in pd.GroupBy.apply()? The problem is best understood with an example. In the following 4 equivalent solutions to a dummy problem, approaches 1 and 4 observe the sort flag, while approaches 2 and 3 ignore it for some reason.

import pandas as pd
import numpy as np 

#################################################
# Construct input data:
cats = list("bcabca")
vals = np.arange(0,10*len(cats),10) 
df = pd.DataFrame({"i": cats, "ii": vals})

# df:
#      i  ii
#   0  b   0
#   1  c  10
#   2  a  20
#   3  b  30
#   4  c  40
#   5  a  50

# Groupby with sort=True
g = df.groupby("i", sort=True)

#################################################
# 1) This correctly returns a sorted series
ret1 = g.apply(lambda df: df["ii"]+1)

# ret1:
#   i
#   a  2    21
#      5    51
#   b  0     1
#      3    31
#   c  1    11
#      4    41

#################################################
# 2) This ignores the sort flag
ret2 = g.apply(lambda df: df[["ii"]]+1)

# ret2:
#      ii
#   0   1
#   1  11
#   2  21
#   3  31
#   4  41
#   5  51

#################################################
# 3) This also ignores the sort flag.
def fun(df):
    df["iii"] = df["ii"] + 1
    return df
ret3 = g.apply(fun)

# ret3
#      i  ii  iii
#   0  b   0    1
#   1  c  10   11
#   2  a  20   21
#   3  b  30   31
#   4  c  40   41
#   5  a  50   51

#################################################
# 4) This, however, respects the sort flag again:
ret4 = {}
for key, dfg in g:
    ret4[key] = fun(dfg)
ret4 = pd.concat(ret4, axis=0)

# ret4:
#        i  ii  iii
#   a 2  a  20   21
#     5  a  50   51
#   b 0  b   0    1
#     3  b  30   31
#   c 1  c  10   11
#     4  c  40   41

Is this a design flaw in pandas? Or is this behavior intentional? From the documentation of pd.DataFrame.groupby() and pd.GroupBy.apply(), I would expect solutions 2 and 3 to also take the sort flag into account. Why would they not?

(The problem was reproduced with pandas 1.2.4 and 1.4.0)


Update: A workaround for approaches 2 and 3 is to first sort the DataFrame by the grouping key. Source of inspiration: See link in the comments.

# Approach 2:
df.sort_values("i").groupby("i").apply(lambda df: df[["ii"]]+1)
# Approach 3:
df.sort_values("i").groupby("i").apply(fun)
normanius
  • 8,629
  • 7
  • 53
  • 83
  • 1
    I am honestly baffled by this behavior too, but you might consider poking around [this thread](https://github.com/pandas-dev/pandas/issues/15947) in the pandas-dev issues – Derek O Jan 25 '22 at 03:25
  • 1
    @DerekO Thanks. This led me at least to a workaround. Not sure if this is performant, though. – normanius Jan 25 '22 at 04:09
  • If you make a copy of the dataframes for the failing 2 examples before proceeding(to avoid mutation) does it change anything? – sammywemmy Jan 25 '22 at 19:59

1 Answers1

2

I wasn't sure whether to post this as an answer or comment since it's a guess, but I think that if you omit the column that you are sorting by in your operation after the groupby, then pandas no longer "understands" to sort by that column.

In example 2), ret2 = g.apply(lambda df: df[["ii"]]+1) means that in your lambda function, you are dropping the "i" column from consideration so pandas no longer has this column to sort by.

In example 4), you are concatenating the entire df including column 'i' so pandas "knows" to sort by that column.

Derek O
  • 16,770
  • 4
  • 24
  • 43
  • Thanks for the thought. The returned df from approach 3 does have a column `i`. For me it is still unclear why `GroupBy.apply()` would not use the same grouping as the iterator produces in `for key, dfg in g: ...`. – normanius Jan 25 '22 at 04:23
  • Yeah I agree that this idea doesn't explain why example 3 doesn't work – Derek O Jan 25 '22 at 05:21