5

While debugging a Python programme, I recently discovered that the Python itertools#groupby() function requires the input collection to be sorted, because it only groups identical elements that occur in a sequence:

Generally, the iterable needs to already be sorted on the same key function.

The operation of groupby() is similar to the uniq filter in Unix

In both cases, uniq and Python's groupby(), I wonder what a use case might be for applying these without sorting.

Clearly, sorting can be expensive and should be avoided whenever possible. However, if sorting is apparently inevitable in practice, then why did the Python developers decide to not make it the default in groupby()? This seems to cause a lot of confusion among the users of the function.

I noted that this design decision does not seem to be universal. Languages like Scala seem to implicitly sort collections in their groupBy() functions.

My question is hence: what are the use cases that led to the design decision about not implicitly sorting in uniq and Python's groupby()?

Community
  • 1
  • 1
Carsten
  • 1,912
  • 1
  • 28
  • 55
  • 2
    One thing to note is that you can’t sort infinite inputs (or otherwise generically produce entire groups online), which both `uniq` and `itertools.groupby` accept. – Ry- Jun 08 '19 at 14:04
  • 2
    As well as what Ry has said... you'd also lose the ability to group consecutive data... eg changing: `[1, 1, 1, 0, 0, 1, 1, 0, 0, 0]` to something like `[(1, 3), (0, 2), (1, 2), (0, 3)]`... If `groupby` was to explicitly order that data... I wouldn't be able to get the result I'm after... while if it makes no such assumption, it's up to the developer to sort it/do whatever, to make sure the grouping occurs... – Jon Clements Jun 08 '19 at 14:05
  • 1
    Because sorting might be done elsewhere in your code, or maybe your iterable is already sorted by default/definition like `list(range(10))`. In these cases, `groupBy()` doesn't need to waste time sorting something that's already sorted, or checking to see if it's sorted or not. – jfaccioni Jun 08 '19 at 14:07
  • Adding to the remark about Scala, that does sort before grouping: Scala also does not have a `groupBy()` function on the `Iterator`, so that infinite collections are excluded there: https://stackoverflow.com/questions/29427059/how-to-groupby-an-iterator-without-converting-it-to-list-in-scala – Carsten Jun 11 '19 at 13:11

1 Answers1

0

You can use a comprehension with internal side effects to group on an iterator without sorting (and without using libraries) like this:

from random import randrange
source   = ( randrange(20) for _ in range(20) )
getKey   = lambda n: n % 5
grouped, = ([d][any(d.setdefault(getKey(v),[]).append(v) for v in source)] for d in [dict()])

print(grouped)
# {2: [17, 2, 17, 17, 17], 1: [1, 11, 1, 16, 1], 4: [19, 19, 14, 19, 9], 3: [3, 3], 0: [0, 10, 5]}
Alain T.
  • 40,517
  • 4
  • 31
  • 51