2

I'm having an issue using the heapq.merge function with an iterator class. Once the merge function is down to one iterator it finishes with line yield from next.__self__ which in the case of an iterator class, restarts the iterator. The behaviour can be observed with this code:

import heapq

class SimpleIterator:

  a_list = [0,1,2,3,4,5]

  def __iter__(self):
     self.list_iter = iter(self.a_list)
     return self

  def __next__(self):
     return next(self.list_iter)**2

print(list(SimpleIterator()))
print(list(heapq.merge(SimpleIterator(), SimpleIterator())))

Expected output:
[0, 1, 4, 9, 16, 25]
[0, 0, 1, 1, 4, 4, 9, 9, 16, 16, 25, 25]

Actual output:
[0, 1, 4, 9, 16, 25]
[0, 0, 1, 1, 4, 4, 9, 9, 16, 16, 25, 25, 0, 1, 4, 9, 16, 25]

Am I setting up my iterator class incorrectly?
Is there a problem with the merge function?
Should the merge function not be used with an Iterator class?

I have a work around of using a generator function to wrap around the class but I was interested if there were other solutions

Edit: the work around:

def generator_wrapper(obj: SimpleIterator):
    yield from obj

print(list(generator_wrapper(SimpleIterator())))
print(list(heapq.merge(generator_wrapper(SimpleIterator()), generator_wrapper(SimpleIterator())))) 
Greg
  • 133
  • 7

1 Answers1

2

TL;DR Iterators are not allowed to start returning values after they raise StopIteration. SimpleIterator does because it can reset the underlying list iterator after that iterator is exhausted.


heapq.merge, once it detects that all but one iterable has been exhausted, uses yield from next.__self__ to iterate over the remaining iterable. This triggers a second call to the second iterable's __iter__ method, in which you unconditionally create a new iterator over the list.

One solution would be to only create self.list_iter if it does not already exist.

class SimpleIterator:

    a_list = [0,1,2,3,4,5]
    def __init__(self):
        self.list_iter = None

    def __iter__(self):
        if self.list_iter is None:
            self.list_iter = iter(self.a_list)
        return self

    def __next__(self):
        return next(self.list_iter)**2

This ensures the existing list iterator is used to finish the merge.

From the documentation for __next__:

Once an iterator’s __next__() method raises StopIteration, it must continue to do so on subsequent calls. Implementations that do not obey this property are deemed broken.

By resetting self.list_iter after its __next__ has raised StopIteration the first time, your implementation of __iter__ does not satisfy this requirement.

This is only a problem because __iter__ does not return a new iterator for an instance of SimpleIterator, but returns the existing object itself. The problem also goes away if you separate the iterator from the iterable. For example,

class SimpleIterable:
    a_list = [0,1,2,3,4,5]
    
    def __iter__(self):
        return SimpleIterator(self)


class SimpleIterator:
    def __init__(self, iterable):
        self.list_iterator = iter(iterable.a_list)

    # All iterators should return self
    def __iter__(self):
        return self

    def __next__(self):
        return next(self.list_iterator)
chepner
  • 497,756
  • 71
  • 530
  • 681
  • 1
    This feels like imprisoning Al Capone for tax evasion. I.e., you convict it for breaking some law, just so you can convict it, even if that's not the real issue. If we modify it so that it still resets but only if it hasn't reached StopIteration yet, then it wouldn't break that rule but it would still not be an alright iterator. – Kelly Bundy Mar 27 '23 at 13:34
  • 1
    Why would that not be an alright iterator? It's not the correct implementation *here*, but there is nothing fundamentally wrong with an iterator "finding" more items to return between calls to `__next__`. (That's basically how `tail -f` works: it treats an EOF condition as a signal to wait before trying to read again, rather than a signal to close the file and exit.) – chepner Mar 27 '23 at 13:55
  • 1
    Finding more after the current point is different than resetting, going back to the start. In the finding-more case, you can't even tell that that happened. In this particular case, heapq.merge, I think that that modified version would "work", but I'm sure there are other things where it would fail, and where an iterator that merely finds more ahead won't. I'll try to think of some during lunch. – Kelly Bundy Mar 27 '23 at 14:05
  • 1
    Already found one: It breaks the `consume` itertools recipe. If you use that recipe to consume a few elements of the iterator and then try to iterate the **remaining** ones, you instead get **all**. – Kelly Bundy Mar 27 '23 at 14:12
  • 1
    [Demo](https://tio.run/##dZJBS8NAEIXv@RVDvSRlG2qrIEIP4qkXL@JJJCzJpFnY7MadibT@@bi7aWIrWAINzHvfvJlJd@LGmu1D54ahdrYFxejYWk2g2s46BkValZgkFdZQWkN9i2kQSbZOgNm9WIPZYwL@t3iqvqQpEbhBmDRgVsTYEcgGZZXDvgbjoRB8YiICGlYO9SlfRNINvBFC3ZuSlVd4oORZO5E9kuEZqEOs8mhTF@wx0girvSKGGtv8ZlOGLUj4RmdXGs2BG6jws8fZW1qtcQyRx8rF7K08es9unUU1arrqKadd2LFz2/EJ4i6BWDpW5hDyd5ZUwIOZvQaPnI5rv9y0fzIRR8uSpNSSCF79jTTuz5rQXRZaEcMO3tfiVmzEVtyJ@w9fCOcrioAripRQ1@ebQXjPgykWvTP8RUU@wsbxwCH3zkT5jAtJ/@DOsjjDNTpbLjdJQr7Ddew0SzrnD5EuKfunPn13JGB7oR6GHw) – Kelly Bundy Mar 27 '23 at 14:12