0

Assuming I have a generator yielding hashable values (str / int etc.) is there a way to prevent the generator from yielding the same value twice?

Obviously, I'm using a generator so I don't need to unpack all the values first so something like yield from set(some_generator) is not an option, since that will unpack the entire generator.

Example:

# Current result
for x in my_generator():
    print(x)

>>> 1
>>> 17
>>> 15
>>> 1   # <-- This shouldn't be here
>>> 15  # <-- This neither!
>>> 3
>>> ...

# Wanted result
for x in my_no_duplicate_generator():
    print(x)

>>> 1
>>> 17
>>> 15
>>> 3
>>> ...

What's the most Pythonic solution for this?

bluesummers
  • 11,365
  • 8
  • 72
  • 108
  • 5
    You can avoid unpacking everything into a set, but you will need to remember everything yielded, which will be a set the size of everything yielded. – Mark Jul 28 '19 at 08:25
  • That kind of misses the point, right now the only solution i can think of is to remember only the hash of each value... Hashes will use less memory than the entire values – bluesummers Jul 28 '19 at 08:26
  • Hashes can collide, which may or may not be a problem for you. – cha0site Jul 28 '19 at 08:29
  • 1
    Or you could use a sliding window and only remember the [hash of the] last *n* values, if that helps at all. – deceze Jul 28 '19 at 08:29
  • Well you cannot predict the future if a value will be yielded or .not, so you need some kinda lookup table for values which are yielded but shouldn't be, and yield then next one – Devesh Kumar Singh Jul 28 '19 at 08:30
  • 2
    Hashes can collide - if that's not a problem, you could use something like Bloom filter https://en.wikipedia.org/wiki/Bloom_filter – Andrej Kesely Jul 28 '19 at 08:31
  • About the "hashes can collide" thing, I know this is true, but wouldn't that be an issue also if using python's `set()` to unpack them all ? – bluesummers Jul 28 '19 at 08:36
  • 1
    python's set and dict have ways to resolve hash collisions, you can read the source if you're interested https://github.com/python/cpython/blob/master/Objects/setobject.c – Chris_Rands Jul 28 '19 at 08:48
  • there's a question for dict explaining this too https://stackoverflow.com/questions/327311/how-are-pythons-built-in-dictionaries-implemented – Chris_Rands Jul 28 '19 at 08:55

2 Answers2

2

You can try this:

def my_no_duplicate_generator(iterable):
    seen = set()
    for x in iterable:
        if x not in seen:
            yield x
            seen.add(x)

You can use it by passing your generator as an argument:

for x in my_no_duplicate_generator(my_generator()):
    print(x)
Nikos Oikou
  • 231
  • 1
  • 5
  • +1 but 1. the function should accept an iterator (`my_generator()`) instead of a function (`my_generator`) because that's more flexible. 2. this won't work with unhashable types (see my answer). – jferard Jul 28 '19 at 09:53
  • Thank you @jferard, very insightful. I updated my answer – Nikos Oikou Jul 28 '19 at 21:42
2

There is a unique_everseen in Python itertools module recipes that is roughly equivalent to @NikosOikou's answer.

The main drawback of these solutions is that they rely upon the hypothesis that elements of the iterable are hashable:

>>> L = [[1], [2,3], [1]]
>>> seen = set()
>>> for e in L: seen.add(e)
... 
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'

The more-itertools module refines the implementation to accept unhashables elements and the doc give a tip on how to keep a good speed in some cases (disclaimer: I'm the "author" of the tip).

You can check the source code.

jferard
  • 7,835
  • 2
  • 22
  • 35