13

I've noticed that there are several uses of pd.DataFrame.groupby followed by an apply implicitly assuming that groupby is stable - that is, if a and b are instances of the same group, and pre-grouping, a appeared before b, then a will appear pre b following the grouping as well.

I think there are several answers clearly implicitly using this, but, to be concrete, here is one using groupby+cumsum.

Is there anything actually promising this behavior? The documentation only states:

Group series using mapper (dict or key function, apply given function to group, return result as series) or by a series of columns.

Also, pandas having indices, the functionality could be theoretically be achieved also without this guarantee (albeit in a more cumbersome way).

Community
  • 1
  • 1
Ami Tavory
  • 74,578
  • 11
  • 141
  • 185
  • Sorry are you asking if 2 rows say `a` and `b` have same value are they guaranteed to be in the same order after grouping? I'm sure that I've seen whilst stepping through the code that they perform stable-sorts I'd have to find this again – EdChum Sep 07 '16 at 15:26
  • @EdChum Yes, if I understand you correctly. If rows *a* and *b* are equivalent as far as the grouping criteria (they will end up in the same group), are they *guaranteed* to retain their order post grouping. I've always seen it in practice, but it's a bit worrying that the docs don't seem to guarantee this. – Ami Tavory Sep 07 '16 at 15:28
  • I've always seen this behaviour and never seen any other kind of behaviour, the fact the docs don't specify or guarantee this doesn't perturb me, but I've seen whilst stepping through the code lots of comments and references to stable-sorts performed and this makes logical sense to me because the alternative would be to just make functions like `transform` a pain to coalesce back to the orig df index if the grouping decided to change the original order – EdChum Sep 07 '16 at 15:31
  • @EdChum Many thanks. – Ami Tavory Sep 07 '16 at 15:32
  • If the order wasn't preserved so long as the orig index was kept then it would still align correctly but it would just be a pain to align if the returned series didn't have increasing order as it would require sorting prior to assignment back to the orig df – EdChum Sep 07 '16 at 15:34
  • @EdChum Many thanks for all your illuminating comments. I believe my last paragraph is a (probably inferior) rephrasing of your last one. – Ami Tavory Sep 07 '16 at 15:35
  • I think this is what I was referring to: https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#L291 and this: https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#L4356 – EdChum Sep 07 '16 at 15:43
  • @EdChum Many thanks! – Ami Tavory Sep 07 '16 at 15:45

2 Answers2

8

Although the docs don't state this internally, it uses stable sort when generating the groups.

See:

As I mentioned in the comments, this is important if you consider transform which will return a Series with it's index aligned to the original df. If the sorting didn't preserve the order, then this would make alignment perform additional work as it would need to sort the Series prior to assigning. In fact, this is mentioned in the comments:

_algos.groupsort_indexer implements counting sort and it is at least O(ngroups), where

ngroups = prod(shape)

shape = map(len, keys)

That is, linear in the number of combinations (cartesian product) of unique values of groupby keys. This can be huge when doing multi-key groupby. np.argsort(kind='mergesort') is O(count x log(count)) where count is the length of the data-frame; Both algorithms are stable sort and that is necessary for correctness of groupby operations.

e.g. consider: df.groupby(key)[col].transform('first')

Amelia N Chu
  • 323
  • 1
  • 5
  • 11
EdChum
  • 376,765
  • 198
  • 813
  • 562
1

Yes; the description of the sort parameter of DataFrame.groupby now promises that groupby (with or without key sorting) "preserves the order of rows within each group":

sort : bool, default True

Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group.

teichert
  • 3,963
  • 1
  • 31
  • 37