12

While reading the python documentation I came across the itertools.groupby() function. It was not very straightforward so I decided to look up some info here on stackoverflow. I found something from How do I use Python's itertools.groupby()?.

There seems to be little info about it here and in the documentation so I decided I to post my observations for comments.

Thanks

Community
  • 1
  • 1
chidimo
  • 2,684
  • 3
  • 32
  • 47
  • Did you checked [`grouby()` document](https://docs.python.org/2/library/itertools.html#itertools.groupby)? Which part was not straight forward in this? – Moinuddin Quadri Dec 31 '16 at 20:29
  • @MoinuddinQuadri The first sentence of the OP's question states that they read the Python documentation. – SudoKid Dec 31 '16 at 20:34
  • 1
    you ask a question for which you have an elaborate answer prepared? really? why not have all that in the question and leave the answers section for discussion? – hiro protagonist Dec 31 '16 at 20:34
  • 6
    @hiroprotagonist [It's perfectly acceptable to ask a question for the sole purpose of answering it](https://stackoverflow.blog/2011/07/its-ok-to-ask-and-answer-your-own-questions/). I've done it myself. "why not have all that in the question" Because an answer isn't part of a question. An answer is an answer. – Tagc Dec 31 '16 at 20:46
  • @EmettSpeer My actual question is *"Which part was not straight forward in this?"*. I mentioned the link to the doc just to make sure that OP checked the official Python document, and not of any tutorial – Moinuddin Quadri Dec 31 '16 at 20:46
  • @MoinuddinQuadri That is very valid and I was not putting down your question. I was only pointing out that the first part of your question was already answered. – SudoKid Dec 31 '16 at 20:52
  • I would answer you Moinuddin. I am relatively new to Python and a lot of times I've often gotten frustrated looking for solutions. I read the docs. groupby() was the most complicated of all. I haven't really wrapped my head around the whole class thing. And the docs' examples were not so clear. I didn't think I'd sit around and wait for someone to ask the question before I answer. Hope my being proactive offends you not. And I stated clearly that I was only posting observations for comments. I might have missed a thing or added two. – chidimo Dec 31 '16 at 21:05

2 Answers2

24

To start with, you may read the documentation here.

I will place what I consider to be the most important point first. I hope the reason will become clear after the examples.

ALWAYS SORT ITEMS WITH THE SAME KEY TO BE USED FOR GROUPING SO AS TO AVOID UNEXPECTED RESULTS

itertools.groupby(iterable, key=None or some func) takes a list of iterables and groups them based on a specified key. The key specifies what action to apply to each individual iterable, the result of which is then used as the heading for each grouping the items; items which end up having same 'key' value will end up in the same group.

The return value is an iterable similar to a dictionary in that it is of the form {key : value}.

Example 1

# note here that the tuple counts as one item in this list. I did not
# specify any key, so each item in the list is a key on its own.
c = groupby(['goat', 'dog', 'cow', 1, 1, 2, 3, 11, 10, ('persons', 'man', 'woman')])
dic = {}
for k, v in c:
    dic[k] = list(v)
dic

results in

{1: [1, 1],
 'goat': ['goat'],
 3: [3],
 'cow': ['cow'],
 ('persons', 'man', 'woman'): [('persons', 'man', 'woman')],
 10: [10],
 11: [11],
 2: [2],
 'dog': ['dog']}

Example 2

# notice here that mulato and camel don't show up. only the last element with a certain key shows up, like replacing earlier result
# the last result for c actually wipes out two previous results.

list_things = ['goat', 'dog', 'donkey', 'mulato', 'cow', 'cat', ('persons', 'man', 'woman'), \
               'wombat', 'mongoose', 'malloo', 'camel']
c = groupby(list_things, key=lambda x: x[0])
dic = {}
for k, v in c:
    dic[k] = list(v)
dic

results in

{'c': ['camel'],
 'd': ['dog', 'donkey'],
 'g': ['goat'],
 'm': ['mongoose', 'malloo'],
 'persons': [('persons', 'man', 'woman')],
 'w': ['wombat']}

Now for the sorted version

 # but observe the sorted version where I have the data sorted first on same key I used for grouping
list_things = ['goat', 'dog', 'donkey', 'mulato', 'cow', 'cat', ('persons', 'man', 'woman'), \
               'wombat', 'mongoose', 'malloo', 'camel']
sorted_list = sorted(list_things, key = lambda x: x[0])
print(sorted_list)
print()
c = groupby(sorted_list, key=lambda x: x[0])
dic = {}
for k, v in c:
    dic[k] = list(v)
dic

results in

['cow', 'cat', 'camel', 'dog', 'donkey', 'goat', 'mulato', 'mongoose', 'malloo', ('persons', 'man', 'woman'), 'wombat']
{'c': ['cow', 'cat', 'camel'],
 'd': ['dog', 'donkey'],
 'g': ['goat'],
 'm': ['mulato', 'mongoose', 'malloo'],
 'persons': [('persons', 'man', 'woman')],
 'w': ['wombat']}

Example 3

things = [("animal", "bear"), ("animal", "duck"), ("plant", "cactus"), ("vehicle", "harley"), \
          ("vehicle", "speed boat"), ("vehicle", "school bus")]
dic = {}
f = lambda x: x[0]
for key, group in groupby(sorted(things, key=f), f):
    dic[key] = list(group)
dic

results in

{'animal': [('animal', 'bear'), ('animal', 'duck')],
 'plant': [('plant', 'cactus')],
 'vehicle': [('vehicle', 'harley'),
  ('vehicle', 'speed boat'),
  ('vehicle', 'school bus')]}

Now for the sorted version. I changed the tuples to lists here. Same results either way.

things = [["animal", "bear"], ["animal", "duck"], ["vehicle", "harley"], ["plant", "cactus"], \
          ["vehicle", "speed boat"], ["vehicle", "school bus"]]
dic = {}
f = lambda x: x[0]
for key, group in groupby(sorted(things, key=f), f):
    dic[key] = list(group)
dic

results in

{'animal': [['animal', 'bear'], ['animal', 'duck']],
 'plant': [['plant', 'cactus']],
 'vehicle': [['vehicle', 'harley'],
  ['vehicle', 'speed boat'],
  ['vehicle', 'school bus']]}
user2357112
  • 260,549
  • 28
  • 431
  • 505
chidimo
  • 2,684
  • 3
  • 32
  • 47
  • "`itertools.groupby(iterable, key=None or some func)` takes a list of iterables" Does it take a list of iterables, or just an iterable? A list is an iterable. – Tagc Dec 31 '16 at 20:43
  • The docs doesn't say explicitly. But from the examples I posted you can see that I used both a list and nested list. So it can take an "iterable" (Example 1) as well as a "list of iterables" (Example 2). You may even pass in a single string and you'd still be in business – chidimo Dec 31 '16 at 21:19
11

As always the documentation of the function should be the first place to check. However itertools.groupby is certainly one of the trickiest itertools because it has some possible pitfalls:

  • It only groups the items if their key-result is the same for successive items:

    from itertools import groupby
    
    for key, group in groupby([1,1,1,1,5,1,1,1,1,4]):
        print(key, list(group))
    # 1 [1, 1, 1, 1]
    # 5 [5]
    # 1 [1, 1, 1, 1]
    # 4 [4]
    

    One could use sorted before - if one wants to do an overall groupby.

  • It yields two items, and the second one is an iterator (so one needs to iterate over the second item!). I explicitly needed to cast these to a list in the previous example.

  • The second yielded element is discarded if one advances the groupby-iterator:

    it = groupby([1,1,1,1,5,1,1,1,1,4])
    key1, group1 = next(it)
    key2, group2 = next(it)
    print(key1, list(group1))
    # 1 []
    

    Even if group1 isn't empty!

As already mentioned one can use sorted to do an overall groupby operation but that's extremely inefficient (and throws away the memory-efficiency if you want to use groupby on generators). There are better alternatives available if you can't guarantee that the input is sorted (which also don't require the O(n log(n)) sorting time overhead):

However it's great to check local properties. There are two recipes in the itertools-recipes section:

def all_equal(iterable):
    "Returns True if all the elements are equal to each other"
    g = groupby(iterable)
    return next(g, True) and not next(g, False)

and:

def unique_justseen(iterable, key=None):
    "List unique elements, preserving order. Remember only the element just seen."
    # unique_justseen('AAAABBBCCDAABBB') --> A B C D A B
    # unique_justseen('ABBCcAD', str.lower) --> A B C A D
    return map(next, map(itemgetter(1), groupby(iterable, key)))
user2357112
  • 260,549
  • 28
  • 431
  • 505
MSeifert
  • 145,886
  • 38
  • 333
  • 352
  • Thanks. I'll definitely take note in case I need some alternatives sometime. For now I'm reading the docs section by section so as not to jumble up everything. And a happy new year to you – chidimo Dec 31 '16 at 21:24
  • Great info here. The documentation for `collections.defaultdict` has a very straightforward example about how to group values: https://docs.python.org/3/library/collections.html#defaultdict-examples – Mass Dot Net Jun 09 '20 at 16:38