3

I've encountered some unexpected empty lists when using zip to transpose the results of itertools.groupby. In reality my data is a bunch of objects, but for simplicity let's say my starting data is this list:

> a = [1, 1, 1, 2, 1, 3, 3, 2, 1]

I want to group the duplicates, so I use itertools.groupby (sorting first, because otherwise groupby only groups consecutive duplicates):

from itertools import groupby
duplicates = groupby(sorted(a))

This gives an itertools.groupby object which when converted to a list gives

[(1, <itertools._grouper object at 0x7fb3fdd86850>), (2, <itertools._grouper object at 0x7fb3fdd91700>), (3, <itertools._grouper object at 0x7fb3fdce7430>)]

So far, so good. But now I want to transpose the results so I have a list of the unique values, [1, 2, 3], and a list of the items in each duplicate group, [<itertools._grouper object ...>, ...]. For this I used the solution in this answer on using zip to "unzip":

>>> keys, values = zip(*duplicates)
>>> print(keys)
(1, 2, 3)
>>> print(values)
(<itertools._grouper object at 0x7fb3fdd37940>, <itertools._grouper object at 0x7fb3fddfb040>, <itertools._grouper object at 0x7fb3fddfb250>)

But when I try to read the itertools._grouper objects, I get a bunch of empty lists:

>>> for value in values:
...    print(list(value))
...
[]
[]
[]

What's going on? Shouldn't each value contain the duplicates in the original list, i.e. (1, 1, 1, 1, 1), (2, 2) and (3, 3)?

Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
Sean
  • 1,346
  • 13
  • 24
  • 1
    I believe ´zip(*duplicates)' advances the groupby() object (see https://docs.python.org/3/library/itertools.html#itertools.groupby) and the contents is removed. It's been a while since I've been coding Python, but try using izip instead of zip if you're using Python 2.x.. I'm not sure though. – Steinar Lima Feb 22 '21 at 17:26
  • Yes, this is almost certainly due to unexpectedly consuming the `groupby` object, which is single-pass. – BallpointBen Feb 27 '21 at 15:40

2 Answers2

1

To have grouping by each unique key for duplicate processing:

import itertools

a = [1, 1, 1, 2, 1, 3, 3, 2, 1]
g1 = itertools.groupby(sorted(a))
for k,v in g1:
    print(f"Key {k} has", end=" ")
    for e in v:
        print(e, end=" ")
    print()
# Key 1 has 1 1 1 1 1 
# Key 2 has 2 2 
# Key 3 has 3 3 

If it's just for counting how many, with minimal sorting:

import itertools
import collections

a = [1, 1, 1, 2, 1, 3, 3, 2, 1]
g1 = itertools.groupby(a)
c1 = collections.Counter()
for k,v in g1:
    l = len(tuple(v))
    c1[k] += l
for k,v in c1.items():
    print(f"Element {k} repeated {v} times")
# Element 1 repeated 5 times
# Element 2 repeated 2 times
# Element 3 repeated 2 times
VirtualScooter
  • 1,792
  • 3
  • 18
  • 28
  • Thanks for your answer, but it's not quite what I am after. In my question I state "I want to group the duplicates", by which I mean I want, for each nonidentical value, a list of the values considered identical by the key function passed to `groupby` in the input sequence. In other words, I am not after the counts, but the real objects. I used numbers in my question for simplicity, but in reality, as I state, the input is objects and I have to use a key function in `groupby` to separate them into each group. – Sean Feb 27 '21 at 14:55
  • OK. Let me change my answer to reflect the list of duplicates. – VirtualScooter Feb 27 '21 at 15:25
1

Ah. The beauty of multiple iterator all using the same underlying object.

The documentation of groupby addresses this very issue:

The returned group is itself an iterator that shares the underlying iterable with groupby(). Because the source is shared, when the groupby() object is advanced, the previous group is no longer visible. So, if that data is needed later, it should be stored as a list:

groups = []
uniquekeys = []
data = sorted(data, key=keyfunc)
for k, g in groupby(data, keyfunc):
    groups.append(list(g))      # Store group iterator as a list
    uniquekeys.append(k)

So what ends up happening is that all your itertools._grouper objects are consumed before you ever unpack them. You see a similar effect if you try reusing any other iterator more than once. If you want to understand better, look at the next paragraph in the docs, which shows how the internals of groupby actually work.

Part of what helped me understand this is to work examples with a more obviously non-reusable iterator, like a file object. It helps to dissociate from the idea of an underlying buffer you can just keep track of.

A simple fix is to consume the objects yourself, as the documentation recommends:

# This is an iterator over a list:
duplicates = groupby(sorted(a))

# If you convert duplicates to a list, you consume it

# Don't store _grouper objects: consume them yourself:
keys, values = zip(*((key, list(value)) for key, value in duplicates)

As the other answer suggests, you don't need an O(N log N) solution that involves sorting, since you can do this in O(N) time in a single pass. Rather than use a Counter, though, I'd recommend a defaultdict to help store the lists:

from collections import defaultdict

result = defaultdict(list)
for item in a:
    result[item].append(item)

For more complex objects, you'd index with key(item) instead of item.

Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
  • Thanks, that worked! I did actually end up using the `defaultdict` with `key(item)` approach instead of `groupby` when I couldn't get it to work. Should read the docs more carefully next time... – Sean Feb 27 '21 at 22:58
  • 1
    @Sean. Gets everyone every time. Creates a nice teachable moment every time too. – Mad Physicist Feb 27 '21 at 23:03