678

I haven't been able to find an understandable explanation of how to actually use Python's itertools.groupby() function. What I'm trying to do is this:

  • Take a list - in this case, the children of an objectified lxml element
  • Divide it into groups based on some criteria
  • Then later iterate over each of these groups separately.

I've reviewed the documentation, but I've had trouble trying to apply them beyond a simple list of numbers.

So, how do I use of itertools.groupby()? Is there another technique I should be using? Pointers to good "prerequisite" reading would also be appreciated.

Seanny123
  • 8,776
  • 13
  • 68
  • 124
James Sulak
  • 31,389
  • 11
  • 53
  • 57

15 Answers15

848

IMPORTANT NOTE: You have to sort your data first.


The part I didn't get is that in the example construction

groups = []
uniquekeys = []
for k, g in groupby(data, keyfunc):
   groups.append(list(g))    # Store group iterator as a list
   uniquekeys.append(k)

k is the current grouping key, and g is an iterator that you can use to iterate over the group defined by that grouping key. In other words, the groupby iterator itself returns iterators.

Here's an example of that, using clearer variable names:

from itertools import groupby

things = [("animal", "bear"), ("animal", "duck"), ("plant", "cactus"), ("vehicle", "speed boat"), ("vehicle", "school bus")]

for key, group in groupby(things, lambda x: x[0]):
    for thing in group:
        print("A %s is a %s." % (thing[1], key))
    print("")
    

This will give you the output:

A bear is a animal.
A duck is a animal.

A cactus is a plant.

A speed boat is a vehicle.
A school bus is a vehicle.

In this example, things is a list of tuples where the first item in each tuple is the group the second item belongs to.

The groupby() function takes two arguments: (1) the data to group and (2) the function to group it with.

Here, lambda x: x[0] tells groupby() to use the first item in each tuple as the grouping key.

In the above for statement, groupby returns three (key, group iterator) pairs - once for each unique key. You can use the returned iterator to iterate over each individual item in that group.

Here's a slightly different example with the same data, using a list comprehension:

for key, group in groupby(things, lambda x: x[0]):
    listOfThings = " and ".join([thing[1] for thing in group])
    print(key + "s:  " + listOfThings + ".")

This will give you the output:

animals: bear and duck.
plants: cactus.
vehicles: speed boat and school bus.

Seanny123
  • 8,776
  • 13
  • 68
  • 124
James Sulak
  • 31,389
  • 11
  • 53
  • 57
  • 3
    Is there a way to specify the groups beforehand and then not require sorting? – John Salvatier May 10 '11 at 19:39
  • 3
    itertools usually clicks for me, but I also had a 'block' for this one. I appreciated your examples-- far clearer than docs. I think itertools tend to either click or not, and are much easier to grasp if you happen to have hit similar problems. Haven't needed this one in the wild yet. – Profane Aug 21 '11 at 20:30
  • 4
    @Julian python docs seem great for most stuff but when it comes to iterators, generators, and cherrypy the docs mostly mystify me. Django's docs are doubly baffling. – Marc Maxmeister Oct 01 '12 at 18:19
  • This is a great example as well http://kentsjohnson.com/blog/arch_m1_2005_12.html. The section on groupby() – snakesNbronies Nov 26 '12 at 19:26
  • 1
    You can use a single list comprehension to get lists of groups: ```item_groups = [group[1] for group in itertools.groupby(items, lambda item: item.property)]``` or if they are sorted: ```group_dict = dict([(key, group) for key, group in itertools.groupby(items, lambda item: item.property)])``` – Danny Staple Apr 10 '14 at 16:28
  • 14
    +1 for the sorting -- I didn't understand what you meant until I grouped my data. – Cody Apr 24 '14 at 02:25
  • 1
    It should be noted that storing the group iterator as list as shown in the documentation is important! Otherwise if you try to iterate using group twice it won't work the second time. – asmaier Mar 18 '15 at 16:12
  • 1
    I have a a data set with ~200,000 items in it. There are 90 unique groups. Using this method produces 7155 different groups...Any thoughts? – David Crook Aug 15 '16 at 21:14
  • 9
    @DavidCrook very late to the party but might help someone. It's probably because your array is not sorted try `groupby(sorted(my_collection, key=lambda x: x[0]), lambda x: x[0]))` under the assumption that `my_collection = [("animal", "bear"), ("plant", "cactus"), ("animal", "duck")]` and you want to group by `animal or plant` – redacted Dec 14 '17 at 15:12
  • The pythonic way to create a dict. group_dict = {key: group for key, group in itertools.groupby(items, lambda item: item.property)} – Mickey Perlstein Aug 12 '19 at 19:22
  • For list of tuples, how can you modify `key` argument of `groupby`, to group by based on more elements? – Kots Feb 24 '21 at 16:26
  • I wish I could +10 this for the sorting tip. had me going – D2TheC May 19 '21 at 10:10
  • I don't know who design this stupid API, `You have to sort your data first.` before groupby. Why not let me design and make a CPU by myself before use python? – huang Jul 11 '21 at 11:53
  • I've also struggled a bit with `groupby` at first but got some more insights by working with the tool over quite some time. I wrote them out in an article in case it helps anyone: https://medium.com/codex/python-groupby-tricks-234004132c14 – brvh May 04 '22 at 18:59
  • nice example, speed boat is not a vehicle – Christos Feb 17 '23 at 22:15
155

itertools.groupby is a tool for grouping items.

From the docs, we glean further what it might do:

# [k for k, g in groupby('AAAABBBCCDAABBB')] --> A B C D A B

# [list(g) for k, g in groupby('AAAABBBCCD')] --> AAAA BBB CC D

groupby objects yield key-group pairs where the group is a generator.

Features

  • A. Group consecutive items together
  • B. Group all occurrences of an item, given a sorted iterable
  • C. Specify how to group items with a key function *

Comparisons

# Define a printer for comparing outputs
>>> def print_groupby(iterable, keyfunc=None):
...    for k, g in it.groupby(iterable, keyfunc):
...        print("key: '{}'--> group: {}".format(k, list(g)))
# Feature A: group consecutive occurrences
>>> print_groupby("BCAACACAADBBB")
key: 'B'--> group: ['B']
key: 'C'--> group: ['C']
key: 'A'--> group: ['A', 'A']
key: 'C'--> group: ['C']
key: 'A'--> group: ['A']
key: 'C'--> group: ['C']
key: 'A'--> group: ['A', 'A']
key: 'D'--> group: ['D']
key: 'B'--> group: ['B', 'B', 'B']

# Feature B: group all occurrences
>>> print_groupby(sorted("BCAACACAADBBB"))
key: 'A'--> group: ['A', 'A', 'A', 'A', 'A']
key: 'B'--> group: ['B', 'B', 'B', 'B']
key: 'C'--> group: ['C', 'C', 'C']
key: 'D'--> group: ['D']

# Feature C: group by a key function
>>> # islower = lambda s: s.islower()                      # equivalent
>>> def islower(s):
...     """Return True if a string is lowercase, else False."""   
...     return s.islower()
>>> print_groupby(sorted("bCAaCacAADBbB"), keyfunc=islower)
key: 'False'--> group: ['A', 'A', 'A', 'B', 'B', 'C', 'C', 'D']
key: 'True'--> group: ['a', 'a', 'b', 'b', 'c']

Uses

Note: Several of the latter examples derive from Víctor Terrón's PyCon (talk) (Spanish), "Kung Fu at Dawn with Itertools". See also the groupby source code written in C.

* A function where all items are passed through and compared, influencing the result. Other objects with key functions include sorted(), max() and min().


Response

# OP: Yes, you can use `groupby`, e.g. 
[do_something(list(g)) for _, g in groupby(lxml_elements, criteria_func)]
pylang
  • 40,867
  • 14
  • 129
  • 121
  • 2
    Technically, the docs should probably say `[''.join(g) for k, g in groupby('AAAABBBCCD')] --> AAAA BBB CC D`. – Mateen Ulhaq Oct 24 '18 at 22:55
  • 4
    Yes. Most of the itertools docstrings are "abridged" in this way. Since all of the itertools are iterators, they must be cast to a builtin (`list()`, `tuple()`) or consumed in a loop/comprehension to display the contents. These are redundancies the author likely excluded to conserve space. – pylang Oct 25 '18 at 00:13
74

The example on the Python docs is quite straightforward:

groups = []
uniquekeys = []
for k, g in groupby(data, keyfunc):
    groups.append(list(g))      # Store group iterator as a list
    uniquekeys.append(k)

So in your case, data is a list of nodes, keyfunc is where the logic of your criteria function goes and then groupby() groups the data.

You must be careful to sort the data by the criteria before you call groupby or it won't work. groupby method actually just iterates through a list and whenever the key changes it creates a new group.

funnydman
  • 9,083
  • 4
  • 40
  • 55
Seb
  • 17,141
  • 7
  • 38
  • 27
  • 74
    So you read `keyfunc` and were like "yeah, I know exactly what that is because this documentation is quite straightforward."? Incredible! – Jarad Apr 07 '17 at 19:22
  • 15
    I believe most people know already about this "straightforward" but useless example, since it doesn't say what kind of 'data' and 'keyfunc' to use!! But I guess you don't know either, otherwise you would help people by clarifying it and not just copy-pasting it. Or do you? – Apostolos Mar 28 '18 at 19:14
  • 4
    I will say, that while just pasting in the docs the question already referenced is in no way a helpful answer, the additional statement below that is a nice reminder. The data must first be sorted by the keyfunc. So if the user has a list of classes and she wishes to group by obj.attr_a, `grouping_target = sorted(obj_list, key=lambda o: o.attr_a)` and then a `groups = itertools.groupby(grouping_target, key=lambda o: o.attr_a)`. Otherwise, as noted, it won't work and you'll see duplication of your groupby keys. – Matthew Jul 01 '20 at 01:23
51

A neato trick with groupby is to run length encoding in one line:

[(c,len(list(cgen))) for c,cgen in groupby(some_string)]

will give you a list of 2-tuples where the first element is the char and the 2nd is the number of repetitions.

Edit: Note that this is what separates itertools.groupby from the SQL GROUP BY semantics: itertools doesn't (and in general can't) sort the iterator in advance, so groups with the same "key" aren't merged.

nimish
  • 4,755
  • 3
  • 24
  • 34
34

Another example:

for key, igroup in itertools.groupby(xrange(12), lambda x: x // 5):
    print key, list(igroup)

results in

0 [0, 1, 2, 3, 4]
1 [5, 6, 7, 8, 9]
2 [10, 11]

Note that igroup is an iterator (a sub-iterator as the documentation calls it).

This is useful for chunking a generator:

def chunker(items, chunk_size):
    '''Group items in chunks of chunk_size'''
    for _key, group in itertools.groupby(enumerate(items), lambda x: x[0] // chunk_size):
        yield (g[1] for g in group)

with open('file.txt') as fobj:
    for chunk in chunker(fobj):
        process(chunk)

Another example of groupby - when the keys are not sorted. In the following example, items in xx are grouped by values in yy. In this case, one set of zeros is output first, followed by a set of ones, followed again by a set of zeros.

xx = range(10)
yy = [0, 0, 0, 1, 1, 1, 0, 0, 0, 0]
for group in itertools.groupby(iter(xx), lambda x: yy[x]):
    print group[0], list(group[1])

Produces:

0 [0, 1, 2]
1 [3, 4, 5]
0 [6, 7, 8, 9]
The Singularity
  • 2,428
  • 3
  • 19
  • 48
user650654
  • 5,630
  • 3
  • 41
  • 44
  • That's interesting, but wouldn't itertools.islice be better for chunking an iterable? It returns an object that iterates like a generator, but it uses C code. – trojjer Dec 04 '13 at 10:37
  • @trojjer islice would be better IF the groups are consistent sized. – woodm1979 Dec 17 '13 at 17:48
24

WARNING:

The syntax list(groupby(...)) won't work the way that you intend. It seems to destroy the internal iterator objects, so using

for x in list(groupby(range(10))):
    print(list(x[1]))

will produce:

[]
[]
[]
[]
[]
[]
[]
[]
[]
[9]

Instead, of list(groupby(...)), try [(k, list(g)) for k,g in groupby(...)], or if you use that syntax often,

def groupbylist(*args, **kwargs):
    return [(k, list(g)) for k, g in groupby(*args, **kwargs)]

and get access to the groupby functionality while avoiding those pesky (for small data) iterators all together.

Nate Anderson
  • 18,334
  • 18
  • 100
  • 135
RussellStewart
  • 5,293
  • 3
  • 26
  • 23
  • 3
    Many of the answers refer to the stumbling block that you must sort before groupby to get expected results. I just encountered this answer, which explains the strange behavior I haven't seen before. I haven't seen before because only now was I trying to list(groupby(range(10)) as @singular says. Before that I'd always used the "recommended" approach of "manually" iterating through the groupby objects rather than letting the list() constructor "automatically" do it. – Nate Anderson Sep 11 '14 at 05:13
12

I would like to give another example where groupby without sort is not working. Adapted from example by James Sulak

from itertools import groupby

things = [("vehicle", "bear"), ("animal", "duck"), ("animal", "cactus"), ("vehicle", "speed boat"), ("vehicle", "school bus")]

for key, group in groupby(things, lambda x: x[0]):
    for thing in group:
        print "A %s is a %s." % (thing[1], key)
    print " "

output is

A bear is a vehicle.

A duck is a animal.
A cactus is a animal.

A speed boat is a vehicle.
A school bus is a vehicle.

there are two groups with vehicule, whereas one could expect only one group

nutship
  • 4,624
  • 13
  • 47
  • 64
kiriloff
  • 25,609
  • 37
  • 148
  • 229
  • 6
    You have to sort the data first, using as key the function you are grouping by. This is mentioned in two post above, but is not highlighted. – mbatchkarov Jun 25 '13 at 15:19
  • I was doing a dict comprehension to preserve the sub-iterators by key, until I realised that this was as simple as dict(groupby(iterator, key)). Sweet. – trojjer Dec 04 '13 at 12:00
  • On second thoughts and after experimentation, the dict call wrapped around the groupby will exhaust the group sub-iterators. Damn. – trojjer Dec 04 '13 at 13:57
  • 1
    What is the point of this answer? How is it building on the [original answer](https://stackoverflow.com/a/7286/6862601)? – codeforester Apr 12 '20 at 01:39
  • This answer simply repeats https://stackoverflow.com/users/207/james-sulak answer, and does not provide any additional information that was not previously mentioned in other answers. – deepGrave Dec 16 '22 at 15:40
9

@CaptSolo, I tried your example, but it didn't work.

from itertools import groupby 
[(c,len(list(cs))) for c,cs in groupby('Pedro Manoel')]

Output:

[('P', 1), ('e', 1), ('d', 1), ('r', 1), ('o', 1), (' ', 1), ('M', 1), ('a', 1), ('n', 1), ('o', 1), ('e', 1), ('l', 1)]

As you can see, there are two o's and two e's, but they got into separate groups. That's when I realized you need to sort the list passed to the groupby function. So, the correct usage would be:

name = list('Pedro Manoel')
name.sort()
[(c,len(list(cs))) for c,cs in groupby(name)]

Output:

[(' ', 1), ('M', 1), ('P', 1), ('a', 1), ('d', 1), ('e', 2), ('l', 1), ('n', 1), ('o', 2), ('r', 1)]

Just remembering, if the list is not sorted, the groupby function will not work!

Craig S. Anderson
  • 6,966
  • 4
  • 33
  • 46
pedromanoel
  • 3,232
  • 2
  • 24
  • 23
  • 9
    Actually it works. You might think this behavior as broken, but it's useful in some cases. See answers to this question for an example: http://stackoverflow.com/questions/1553275/how-to-strip-a-list-of-tuple-with-python – Denis Otkidach Oct 15 '09 at 16:29
9

Sorting and groupby

from itertools import groupby

val = [{'name': 'satyajit', 'address': 'btm', 'pin': 560076}, 
       {'name': 'Mukul', 'address': 'Silk board', 'pin': 560078},
       {'name': 'Preetam', 'address': 'btm', 'pin': 560076}]


for pin, list_data in groupby(sorted(val, key=lambda k: k['pin']),lambda x: x['pin']):
...     print pin
...     for rec in list_data:
...             print rec
... 
o/p:

560076
{'name': 'satyajit', 'pin': 560076, 'address': 'btm'}
{'name': 'Preetam', 'pin': 560076, 'address': 'btm'}
560078
{'name': 'Mukul', 'pin': 560078, 'address': 'Silk board'}
Aashish Gahlawat
  • 409
  • 1
  • 7
  • 25
Satyajit Das
  • 111
  • 2
  • 4
8

Sadly I don’t think it’s advisable to use itertools.groupby(). It’s just too hard to use safely, and it’s only a handful of lines to write something that works as expected.

def my_group_by(iterable, keyfunc):
    """Because itertools.groupby is tricky to use

    The stdlib method requires sorting in advance, and returns iterators not
    lists, and those iterators get consumed as you try to use them, throwing
    everything off if you try to look at something more than once.
    """
    ret = defaultdict(list)
    for k in iterable:
        ret[keyfunc(k)].append(k)
    return dict(ret)

Use it like this:

def first_letter(x):
    return x[0]

my_group_by('four score and seven years ago'.split(), first_letter)

to get

{'f': ['four'], 's': ['score', 'seven'], 'a': ['and', 'ago'], 'y': ['years']}
andrewdotn
  • 32,721
  • 10
  • 101
  • 130
  • 1
    Can you please extend on why it's too hard to use safely? – ctholho Jul 22 '21 at 18:35
  • @ctholho It’s explained in the docstring, where it will be easily available if anyone ever looks at the code and wonders why it’s not using the standard library method: “The stdlib method requires sorting in advance, and returns iterators not lists, and those iterators get consumed as you try to use them, throwing everything off if you try to look at something more than once.” – andrewdotn Jul 22 '21 at 22:09
7

How do I use Python's itertools.groupby()?

You can use groupby to group things to iterate over. You give groupby an iterable, and a optional key function/callable by which to check the items as they come out of the iterable, and it returns an iterator that gives a two-tuple of the result of the key callable and the actual items in another iterable. From the help:

groupby(iterable[, keyfunc]) -> create an iterator which returns
(key, sub-iterator) grouped by each value of key(value).

Here's an example of groupby using a coroutine to group by a count, it uses a key callable (in this case, coroutine.send) to just spit out the count for however many iterations and a grouped sub-iterator of elements:

import itertools


def grouper(iterable, n):
    def coroutine(n):
        yield # queue up coroutine
        for i in itertools.count():
            for j in range(n):
                yield i
    groups = coroutine(n)
    next(groups) # queue up coroutine

    for c, objs in itertools.groupby(iterable, groups.send):
        yield c, list(objs)
    # or instead of materializing a list of objs, just:
    # return itertools.groupby(iterable, groups.send)

list(grouper(range(10), 3))

prints

[(0, [0, 1, 2]), (1, [3, 4, 5]), (2, [6, 7, 8]), (3, [9])]
Russia Must Remove Putin
  • 374,368
  • 89
  • 403
  • 331
7

This basic implementation helped me understand this function. Hope it helps others as well:

arr = [(1, "A"), (1, "B"), (1, "C"), (2, "D"), (2, "E"), (3, "F")]

for k,g in groupby(arr, lambda x: x[0]):
    print("--", k, "--")
    for tup in g:
        print(tup[1])  # tup[0] == k
-- 1 --
A
B
C
-- 2 --
D
E
-- 3 --
F
Tiago
  • 1,794
  • 1
  • 9
  • 16
4

One useful example that I came across may be helpful:

from itertools import groupby

#user input

myinput = input()

#creating empty list to store output

myoutput = []

for k,g in groupby(myinput):

    myoutput.append((len(list(g)),int(k)))

print(*myoutput)

Sample input: 14445221

Sample output: (1,1) (3,4) (1,5) (2,2) (1,1)

Arko
  • 289
  • 2
  • 5
4
from random import randint
from itertools import groupby

 l = [randint(1, 3) for _ in range(20)]

 d = {}
 for k, g in groupby(l, lambda x: x):
     if not d.get(k, None):
         d[k] = list(g)
     else:
         d[k] = d[k] + list(g)

the code above shows how groupby can be used to group a list based on the lambda function/key supplied. The only problem is that the output is not merged, this can be easily resolved using a dictionary.

Example:

l = [2, 1, 2, 3, 1, 3, 2, 1, 3, 3, 1, 3, 2, 3, 1, 2, 1, 3, 2, 3]

after applying groupby the result will be:

for k, g in groupby(l, lambda x:x):
    print(k, list(g))

2 [2]
1 [1]
2 [2]
3 [3]
1 [1]
3 [3]
2 [2]
1 [1]
3 [3, 3]
1 [1]
3 [3]
2 [2]
3 [3]
1 [1]
2 [2]
1 [1]
3 [3]
2 [2]
3 [3]

Once a dictionary is used as shown above following result is derived which can be easily iterated over:

{2: [2, 2, 2, 2, 2, 2], 1: [1, 1, 1, 1, 1, 1], 3: [3, 3, 3, 3, 3, 3, 3, 3]}
Ankit Gupta
  • 580
  • 6
  • 11
  • 5
    Please provide an explanation on how this code answers the question (which was literally asking _how_ to use `groupby`). Also, the code has an indentation error. – Gino Mempin Oct 31 '21 at 04:42
1

The key thing to recognize with itertools.groupby is that items are only grouped together as long as they're sequential in the iterable. This is why sorting works, because basically you're rearranging the collection so that all of the items which satisfy callback(item) now appear in the sorted collection sequentially.

That being said, you don't need to sort the list, you just need a collection of key-value pairs, where the value can grow in accordance to each group iterable yielded by groupby. i.e. a dict of lists.

>>> things = [("vehicle", "bear"), ("animal", "duck"), ("animal", "cactus"), ("vehicle", "speed boat"), ("vehicle", "school bus")]
>>> coll = {}
>>> for k, g in itertools.groupby(things, lambda x: x[0]):
...     coll.setdefault(k, []).extend(i for _, i in g)
...
{'vehicle': ['bear', 'speed boat', 'school bus'], 'animal': ['duck', 'cactus']}

Michael Green
  • 719
  • 6
  • 15