Preventing a generator from yielding the same object twice

Question

Assuming I have a generator yielding hashable values (str / int etc.) is there a way to prevent the generator from yielding the same value twice?

Obviously, I'm using a generator so I don't need to unpack all the values first so something like yield from set(some_generator) is not an option, since that will unpack the entire generator.

Example:

# Current result
for x in my_generator():
    print(x)

>>> 1
>>> 17
>>> 15
>>> 1   # <-- This shouldn't be here
>>> 15  # <-- This neither!
>>> 3
>>> ...

# Wanted result
for x in my_no_duplicate_generator():
    print(x)

>>> 1
>>> 17
>>> 15
>>> 3
>>> ...

What's the most Pythonic solution for this?

You can avoid unpacking everything into a set, but you will need to remember everything yielded, which will be a set the size of everything yielded. — Mark, Jul 28 '19 at 08:25
That kind of misses the point, right now the only solution i can think of is to remember only the hash of each value... Hashes will use less memory than the entire values — bluesummers, Jul 28 '19 at 08:26
Hashes can collide, which may or may not be a problem for you. — cha0site, Jul 28 '19 at 08:29
Or you could use a sliding window and only remember the [hash of the] last *n* values, if that helps at all. — deceze, Jul 28 '19 at 08:29
Well you cannot predict the future if a value will be yielded or .not, so you need some kinda lookup table for values which are yielded but shouldn't be, and yield then next one — Devesh Kumar Singh, Jul 28 '19 at 08:30
Hashes can collide - if that's not a problem, you could use something like Bloom filter https://en.wikipedia.org/wiki/Bloom_filter — Andrej Kesely, Jul 28 '19 at 08:31
About the "hashes can collide" thing, I know this is true, but wouldn't that be an issue also if using python's `set()` to unpack them all ? — bluesummers, Jul 28 '19 at 08:36
python's set and dict have ways to resolve hash collisions, you can read the source if you're interested https://github.com/python/cpython/blob/master/Objects/setobject.c — Chris_Rands, Jul 28 '19 at 08:48
there's a question for dict explaining this too https://stackoverflow.com/questions/327311/how-are-pythons-built-in-dictionaries-implemented — Chris_Rands, Jul 28 '19 at 08:55

Nikos Oikou · Answer 1 · 2019-07-28T21:42:02.887

2

You can try this:

def my_no_duplicate_generator(iterable):
    seen = set()
    for x in iterable:
        if x not in seen:
            yield x
            seen.add(x)

You can use it by passing your generator as an argument:

for x in my_no_duplicate_generator(my_generator()):
    print(x)

edited Jul 28 '19 at 21:42

answered Jul 28 '19 at 09:33

Nikos Oikou

231
1
5

+1 but 1. the function should accept an iterator (`my_generator()`) instead of a function (`my_generator`) because that's more flexible. 2. this won't work with unhashable types (see my answer). – jferard Jul 28 '19 at 09:53
Thank you @jferard, very insightful. I updated my answer – Nikos Oikou Jul 28 '19 at 21:42

score 2 · Accepted Answer · answered Jul 28 '19 at 09:50

There is a unique_everseen in Python itertools module recipes that is roughly equivalent to @NikosOikou's answer.

The main drawback of these solutions is that they rely upon the hypothesis that elements of the iterable are hashable:

>>> L = [[1], [2,3], [1]]
>>> seen = set()
>>> for e in L: seen.add(e)
... 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'

The more-itertools module refines the implementation to accept unhashables elements and the doc give a tip on how to keep a good speed in some cases (disclaimer: I'm the "author" of the tip).

You can check the source code.

Preventing a generator from yielding the same object twice

2 Answers2