4

I'm trying to use itertools.tee to know if an iterator is empty without consuming it entirely:

from itertools import tee
def get_iterator(i):
    i1, i2 = tee(i, 2)
    if next(i1, None) is None:
       # iterator is empty - raises some error
       pass
    return i2 # return not empty iterator to caller

As the docs of tee states:

This itertool may require significant auxiliary storage (depending on how much temporary data needs to be stored). In general, if one iterator uses most or all of the data before another iterator starts, it is faster to use list() instead of tee().

so it's clearly when i is not empty, i2 uses most of the data before i1 does. Can a simple del overcome this?:

from itertools import tee
def get_iterator(i):
    i1, i2 = tee(i, 2)
    if next(i1, None) is None:
       # iterator is empty - raises some error
       pass
    del i1  # Does this overcome storage issue?
    return i2  # return not empty iterator to caller

Is there a better way of achieving this goal?

Thanks in advance!

Piotr Dobrogost
  • 41,292
  • 40
  • 236
  • 366
NI6
  • 2,477
  • 5
  • 17
  • 28
  • See [Testing for an empty iterator ActiveState's recipe](https://github.com/ActiveState/code/tree/master/recipes/Python/413614_Testing_for_an_empty_iterator) – Piotr Dobrogost May 30 '18 at 08:46
  • @Chris_Rands *tee basically does exhaust the entire iterator to create the new iterators* – this is totally not true. – Piotr Dobrogost May 30 '18 at 08:50
  • @Chris_Rands The docs say "The following Python code helps explain what tee does (although the actual implementation is more complex and uses only a single underlying FIFO queue)." If you look at `teedataobject_getitem` in the CPython code, you see that it only grabs new data `PyIter_Next` if the lead iterator has reached that point. It then stores that until all `tee`s have used the value. – FHTMitchell May 30 '18 at 09:04
  • See Alex Martelli's remark on sentinel value [here](https://stackoverflow.com/a/3114573/95735). – Piotr Dobrogost May 30 '18 at 09:07
  • @FHTMitchell Are you saying tee does exhaust the original iterator to create the independent ones or not? And how does this impact the memory usage? – Chris_Rands May 30 '18 at 09:19
  • 2
    @Chris_Rands Only when you run through one of the new iterators. For example, if you use `a, b, c = tee(itr, 3)` then if you do `i = next(a); del i` you will have `i` stored in memory until **both** `next(b)` and `next(c)` are executed. Worst case scenario, if you do `la = list(a)` then you will have `len(la)` elements in memory until both `b` and `c` are iterated forward. – FHTMitchell May 30 '18 at 09:23

2 Answers2

4

This is somewhat subtle - it depends on an undocumented property of the tee function along with an intentionally vague property of the garbage collector. The sample Python code would store all the items from the point where the iterators were created until they're consumed by each iterator, but one might easily imagine that the iterators would have a cleanup effect that would drop their claim on data in the queue. But even so, del removes your name; it doesn't guarantee the object's destruction. Such a cleanup would thus work but not necessarily at the time you expect it. Knowing whether this happens would require reading the source code for tee. It does have weak reference support for the individual iterators, suggesting one way this optimization could have been done.

The CPython code for tee_next is reasonably simple; it holds a reference to a teedataobject, which is a batch of up to 57 items, also forming a singly linked list. Thus the normal reference counting semantics apply at that batch level. So basically, for CPython, up to 56 items are kept in memory even after they've been consumed by all the iterators, but no more than that because the reference count handling is immediate. As long as the tee iterators exist, any number of items between them can be held, but they do not read ahead from the source iterator; at least one tee iterator must have fetched the items via teedataobject_getitem.

So the basic verdict is: yes, del will work in CPython, but using tee means you're temporarily storing batches of 57 items, not 1. Repeating this method could cause an arbitrary number of such windows - except tee iterators are copyable, and will share their underlying list.

This is the interpretation specifically of one version (4243df51fe43) of CPython. Implementations will differ in e.g. PyPy, IronPython, Jython, or other versions of CPython.

For instance, PyPy's tee (version cadf868) uses a similar linked list with one item per link, so doesn't batch up the way this CPython version did.

There is one notable shortcut that prevents this cost from growing: both tee implementations I examined produce copyable iterators, and also copy copyable iterators. So repeatedly applying tee doesn't create new layers of iterators, a potential problem with the chain approach.

Yann Vernier
  • 15,414
  • 2
  • 28
  • 26
  • 1
    *Implementations will differ in (…) or different version of CPython.* :) – Piotr Dobrogost May 30 '18 at 09:42
  • What do you mean by *copyable iterator*? – Piotr Dobrogost May 30 '18 at 22:45
  • A copyable iterator has a [`__copy__`](https://docs.python.org/3/library/copy.html) method to produce a copy of itself. A copy of an iterator produces the same output as the first iterator. `tee` produces copies of iterators even if they don't have the `__copy__` method, by storing the values as needed. – Yann Vernier May 31 '18 at 07:04
3

I mean, in your specific case, what wrong with this

from itertools import chain
def get_iterator(i):
    try:
        first = next(i):
    except StopIteration:
       # iterator is empty - raises some error
       pass
    return chain([first], i)

It does exactly the same thing, but doesn't store anything other than the first value.

FHTMitchell
  • 11,793
  • 2
  • 35
  • 47
  • It actually did solved my specific need, but I was also curious about the impact of del on tee objects. Many thanks! – NI6 May 30 '18 at 09:49
  • 1
    This answer deserves its upvotes. It applies to the second question "is there a better way", while mine applied to the first, "can a del overcome [the storage costs of tee]". – Yann Vernier May 30 '18 at 09:51
  • It's a bad taste to give answer already pointed in comments to the question… – Piotr Dobrogost May 30 '18 at 09:53
  • @PiotrDobrogost Honestly, I didn't even follow your link. I came up with this independently. Apologies for not checking first. – FHTMitchell May 30 '18 at 10:23
  • I actually found an argument against this form (and for the `tee` form). Both PyPy and CPython `tee` implementations copy iterators if they can, which doesn't copy the underlying storage - and `tee` iterators are copyable. – Yann Vernier May 30 '18 at 19:30