Detecting if an iterator will be consumed

Question

Is there an uniform way of knowing if an iterable object will be consumed by the iteration?

Suppose you have a certain function crunch which asks for an iterable object for parameter, and uses it many times. Something like:

def crunch (vals):

    for v in vals:
        chomp(v)

    for v in vals:
        yum(v)

(note: merging together the two for loops is not an option).

An issue arises if the function gets called with an iterable which is not a list. In the following call the yum function is never executed:

crunch(iter(range(4))

We could in principle fix this by redefining the crunch function as follows:

def crunch (vals):
    vals = list(vals)

    for v in vals:
        chomp(v)

    for v in vals:
        yum(v)

But this would result in using twice the memory if the call to crunch is:

hugeList = list(longDataStream)
crunch(hugeList)

We could fix this by defining crunch like this:

def crunch (vals):
    if type(vals) is not list:
        vals = list(vals)

    for v in vals:
        chomp(v)

    for v in vals:
        yum(v)

But still there colud be the case in which the calling code stores data in something which

cannot be consumed
is not a list

For instance:

from collections import deque
hugeDeque = deque(longDataStream)
crunch(hugeDeque)

It would be nice to have a isconsumable predicate, so that we can define crunch like this:

def crunch (vals):
    if isconsumable(vals):
        vals = list(vals)

    for v in vals:
        chomp(v)

    for v in vals:
        yum(v)

Is there a solution for this problem?

`if isinstance(vals, collections.Iterator)` is similar to what you want — Volatility, Mar 13 '13 at 08:19
@Volatility: `Iterator` is not the appropriate choice, since an iterator still might be consumed. — BrenBarn, Mar 13 '13 at 08:22
What do you want to happen if `vals` is an infinite iterator? — unutbu, Mar 13 '13 at 08:23
@limelights: the point is exactly that I'm not doing assumptions on the parameters — Dacav, Mar 13 '13 at 08:24
@unutbu: fair question. Indeed in this case the second `for` would never be reached. But this would be an issue in any case. — Dacav, Mar 13 '13 at 08:25
@Volatility: almost there! But the following fails (python2.x): `isinstance(xrange(3), collections.Iterator)` — Dacav, Mar 13 '13 at 08:26
http://stackoverflow.com/questions/9884132/understanding-pythons-iterator-iterable-and-iteration-protocols-what-exact may clarify something on the terms used (iterable, iterator...) — Davide R., Mar 13 '13 at 08:57
I don't think you can guarantee it. Because you can create an iterable object which behaves one way, or the other, based on the result of `random.random()` or whatever other obfuscation you like. What should an object like this tell you when you ask, because it doesn't even know itself ahead of time whether it's consumable or not. — wim, Mar 13 '13 at 09:07

BrenBarn · Answer 1 · 2013-03-13T08:39:02.643

6

One possibility is to test whether the item is a Sequence, using isinstance(val, collections.Sequence). Non-consumability still isn't totally guaranteed but I think it's about the best you can get. A Python sequence has to have a length, which means that at least it can't be an open-ended iterator, and in general implies that the elements have to be known ahead of time, which in turn implies that they can be iterated over without consuming them. It's still possible to write pathological classes that fit the sequence protocol but aren't re-iterable, but you'll never be able to handle those.

Note that neither Iterable nor Iterator is the appropriate choice, because these types don't guarantee a length, and hence can't guarantee that the iteration will even be finite, let alone repeatable. You could, however, check for both Sized and Iterable.

The important thing is to document that your function will iterate over its argument twice, thus warning users that they must pass in an object that supports this.

edited Mar 13 '13 at 08:39

answered Mar 13 '13 at 08:21

BrenBarn

242,874
37
412
384

`Iterable` can be the choice if you don't care about size (that would be a problem of the function doc, not a runtime check, in the OP question). In any case you can check for a `Sized`, `Iterable` instance. `__getitem__` is not necessarily a guarantee. – Davide R. Mar 13 '13 at 08:37
1

@DavideR: You're right that `Sized`+`Iterable` works. `Iterable` alone, though, isn't sufficient, since it's easy to write an iterable that isn't re-iterable. – BrenBarn Mar 13 '13 at 08:40
Yes, it's easy to write one. It's even easy to write a Sequence which isn't re-iterable. But Iterable protocol requires the object to return an iterator whenever you try to iterate over it. That's the contract the OP is looking for. Any other implementation violates the Iterable protocol. – Davide R. Mar 13 '13 at 08:47
I was wrong about the Iterable point. I posted an answer to clarify and explain my error. Sorry. – Davide R. Mar 13 '13 at 09:58
+1, although I'd use `collections.Container` (and possibly also `collections.MappingView`) instead of `collections.Sequence`. – ecatmur Mar 13 '13 at 10:52

glglgl · Accepted Answer · 2013-03-13T10:02:41.613

5

Another, additional option could be to query if the iterable is its own iterator:

if iter(vals) is vals:
    vals = list(vals)

because in this case, it is just an iterator.

This works with generators, iterators, files and many other objects which are designed for "one run", in other words, all iterables which are iterators by itself, because an iterator returns self from its __iter__().

But this might not be enough, because there are objects which empty themselves on iteration without being their own iterator.

Normally, a self-consuming object will be its own iterator, but there are cases where this might not be allowed.

Imagine a class which wraps a list and empties this list on iteration, such as

class ListPart(object):
    """Liste stückweise zerlegen."""
    def __init__(self, data=None):
        if data is None: data = []
        self.data = data
    def next(self):
        try:
            return self.data.pop(0)
        except IndexError:
            raise StopIteration
    def __iter__(self):
        return self
    def __len__(self): # doesn't work with __getattr__...
        return len(self.data)

which you call like

l = [1, 2, 3, 4]
lp = ListPart(l)
for i in lp: process(i)
# now l is empty.

If I add now additional data to that list and iterate over the same object again, I'll get the new data which is a breach of the protocol:

The intention of the protocol is that once an iterator’s next() method raises StopIteration, it will continue to do so on subsequent calls. Implementations that do not obey this property are deemed broken. (This constraint was added in Python 2.3; in Python 2.2, various iterators are broken according to this rule.)

So in this case, the object would have to return an iterator distinct from itself despite of being self-consuming. In this case, this could be done with

def __iter__(self):
    while True:
        try:
            yield l.pop(0)
        except IndexError: # pop from empty list
            return

which returns a new generator on each iteration - something which would fall though the mash in the case we are discussing.

edited Mar 13 '13 at 10:02

answered Mar 13 '13 at 08:27

glglgl

89,107
13
149
217

1

+1 beat me to it. But it's first time I hear about "objects which empty themselves on iteration", can you elaborate? – Kos Mar 13 '13 at 08:31
+1, this seems to be effective. Is the "objects which empty themselves" a pathological case? – Dacav Mar 13 '13 at 08:36
"objects which empty themselves" means the behaviour described - "one-shot iteration" on which the data is handed out and lost. That is not pathological as long as the object is its own iterator. But if you have an object which hands out its data once, but returns a different object on its `__iter__()` call, you are lost. – glglgl Mar 13 '13 at 08:38
1

@gnibbler xrange is not "self-consuming", so you don't need the wrapping, do you? As `iter(vals) is not vals`, you don't create a list, you keep your `xrange` object. But as you can iterate over it twice, that doesn't hurt. – glglgl Mar 13 '13 at 09:08
«Imagine a class which wraps a list and empties this list on iteration» is a stretch and a violation to the Iterable protocol. This class shouldn't implement `__iter__` and must be regarded as an `Iterator`, *not* an `Iterable`. – Davide R. Mar 13 '13 at 09:26
@DavideR. Could you point me to this protocol detail? Isn't the object allowed to be changing? – glglgl Mar 13 '13 at 09:41
@DavideR. Because I think the opposite is true: as I write above, I may not use `next()` here, but have to return a new iterator from `__iter__()` each time. Being an iterator would be a breach, being an iterable returning a separate iterator would be ok from the protocol (if there is not any point which I didn't get till now), but, as said, would not work in our discussed case. – glglgl Mar 13 '13 at 09:49
@DavideR. And BTW: How can an `Iterator` be not implementing `__iter__()`? – glglgl Mar 13 '13 at 09:49
I apologize, it was my confusion. I posted an answer explaining my point and where I was wrong. Yes, your check will indeed work for most cases. Not all cases, as I also explain in my answer. Sorry for my mistake. – Davide R. Mar 13 '13 at 09:56

score 4 · Answer 3 · answered Mar 13 '13 at 08:21

4

def crunch (vals):
    vals1, vals2 = itertools.tee(vals, 2)

    for v in vals1:
        chomp(v)

    for v in vals2:
        yum(v)

In this case tee will end up storing the entirity of vals internally since one iterator is completed before the other one is started

answered Mar 13 '13 at 08:21

John La Rooy

295,403
53
369
502

... and this storing could lead to the same memory "problems" the OP mentions. Nevertheless +1, because there is no easy way around it. – glglgl Mar 13 '13 at 08:25
1

As indicated in [the documentation](http://docs.python.org/2/library/itertools.html#itertools.tee), this is worse than using `list` in a case like this, where the entire iterable will be consumed once before the second iteration is started. – BrenBarn Mar 13 '13 at 08:25
OP here: had the same idea but indeed it would be ineffective. +1 anyways. – Dacav Mar 13 '13 at 08:28
@wim this would fail in two ways in the infinite case: it would never end the first cycle, and `tee` would run out of memory. – Davide R. Mar 13 '13 at 09:27
well, this would, but a similar situation where only a finite amount of an infinite iterator was consumed is a similar/related problem where you would use tee rather than create a list from the iterator. get my drift? – wim Mar 13 '13 at 09:53
Yes, that's what `tee()` is for, indeed. :) – Davide R. Mar 13 '13 at 09:59
@BrenBarn The OP *could*(in some situations) rewrite the cycle to do some `chomp` and some `yum`s alternatively, thus being better than `list`, although the code would become more complicated. – Bakuriu Mar 13 '13 at 16:03

Davide R. · Answer 4 · 2013-03-13T10:17:49.447

Many answers come close to the point but miss it.

An Iterator is an object that is consumed by iterating over it. There is no way around it. Example of iterator objects are those returned by calls to iter(), or those returned by the functions in the itertools module.

The proper way to check whether an object is an iterator is to call isinstance(obj, Iterator). This basically checks whether the object implements the next() method (__next__() in Python 3) but you don't need to care about this.

So, remember, an iterator is always consumed. For example:

# suppose you have a list
my_list = [10, 20, 30]
# and build an iterator on the list
my_iterator = iter(my_list)
# iterate the first time over the object
for x in my_iterator:
    print x
# then again
for x in my_iterator:
    print x

This will print the content of the list just once.

Then there are Iterable objects. When you call iter() on an iterable it will return an iterator. Commenting in this page I made myself an error, so I will clarify here. Iterable objects are not required to return a new iterator on every call. Many iterators themselves are iterables (i.e. you can call iter() on them) and they will return the object itself.

A simple example for this are list iterators. iter(my_list) and iter(iter(my_list)) are the same object, and this is basically what @glglgl answer is checking for.

The iterator protocol requires iterator objects to return themselves as their own iterator (and thus be iterable). This is not required for the iteration mechanics to work, but you wouldn't be able to loop over the iterator object.

All of this said, what you should do is check whether you're given an Iterator, and if that's the case, make a copy of the result of the iteration (with list()). Your isconsumable(obj) is (as someone already said) isinstance(obj, Iterator).

Note that this also works for xrange(). xrange(10) returns an xrange object. Every time you iter over the xrange objects it returns a new iterator starting from the start, so you're fine and don't need to make a copy.

Normally, an iterator should be an iterable as well and thus return `self` from its `__iter__()` in order to iterate over it as well and not only its "parent". See http://docs.python.org/2/library/stdtypes.html#iterator.__iter__. Otherwise you couldn't do `it = iter(l); for i in it: ...`. Till now, I never came across Iterators which weren't iterable as well. — glglgl, Mar 13 '13 at 10:04
@glglgl so basically your answer works fine in all cases that implement the iterator protocol as described in the docs you linked. The identity check baffles me, but it will indeed work. Again, sorry for making a wrong point without checking the right docs. — Davide R., Mar 13 '13 at 10:20

score 1 · Answer 5 · answered Mar 13 '13 at 17:28

Here is a summary of definitions.

container

An object with a __contains__ method

generator

A function which returns an iterator.

iterable

A object with an __iter__() or __getitem__() method.
Examples of iterables include all sequence types (such as list, str, and tuple) and some non-sequence types like dict and file.
When an iterable object is passed as an argument to the builtin function iter(), it returns an iterator for the object. This iterator is good for one pass over the set of values.

iterator

An iterable which has a next() method.
Iterators are required to have an __iter__() method that returns the iterator object itself.
An iterator is good for one pass over the set of values.

sequence

An iterable which supports efficient element access using integer indices via the __getitem__() special method and defines a len() method that returns the length of the sequence.
Some built-in sequence types are list, str, tuple, and unicode.
Note that dict also supports __getitem__() and __len__(), but is considered a mapping rather than a sequence because the lookups use arbitrary immutable keys rather than integers.

Now there is a multitude of ways of testing if an object is an iterable, or iterator, or sequence of some sort. Here is a summary of these ways, and how they classify various kinds of objects:

               Iterable Iterator iter_is_self Sequence MutableSeq
object                                                           
[]                 True    False        False     True       True
()                 True    False        False     True      False
set([])            True    False        False    False      False
{}                 True    False        False    False      False
deque([])          True    False        False    False      False
<listiterator>     True     True         True    False      False
<generator>        True     True         True    False      False
string             True    False        False     True      False
unicode            True    False        False     True      False
<open>             True     True         True    False      False
xrange(1)          True    False        False     True      False
Foo.__iter__       True    False        False    False      False

                Sized has_len has_iter has_contains
object                                             
[]               True    True     True         True
()               True    True     True         True
set([])          True    True     True         True
{}               True    True     True         True
deque([])        True    True     True        False
<listiterator>  False   False     True        False
<generator>     False   False     True        False
string           True    True    False         True
unicode          True    True    False         True
<open>          False   False     True        False
xrange(1)        True    True     True        False
Foo.__iter__    False   False     True        False

Each columns refers to a different way to classify iterables, each rows refers to a different kind of object.

import pandas as pd
import collections
import os


def col_iterable(obj):
    return isinstance(obj, collections.Iterable)


def col_iterator(obj):
    return isinstance(obj, collections.Iterator)


def col_sequence(obj):
    return isinstance(obj, collections.Sequence)


def col_mutable_sequence(obj):
    return isinstance(obj, collections.MutableSequence)


def col_sized(obj):
    return isinstance(obj, collections.Sized)


def has_len(obj):
    return hasattr(obj, '__len__')


def listtype(obj):
    return isinstance(obj, types.ListType)


def tupletype(obj):
    return isinstance(obj, types.TupleType)


def has_iter(obj):
    "Could this be a way to distinguish basestrings from other iterables?"
    return hasattr(obj, '__iter__')


def has_contains(obj):
    return hasattr(obj, '__contains__')


def iter_is_self(obj):
    "Seems identical to col_iterator"
    return iter(obj) is obj


def gen():
    yield


def short_str(obj):
    text = str(obj)
    if text.startswith('<'):
        text = text.split()[0] + '>'
    return text


def isiterable():
    class Foo(object):
        def __init__(self):
            self.data = [1, 2, 3]

        def __iter__(self):
            while True:
                try:
                    yield self.data.pop(0)
                except IndexError:  # pop from empty list
                    return

        def __repr__(self):
            return "Foo.__iter__"
    filename = 'mytestfile'
    f = open(filename, 'w')
    objs = [list(), tuple(), set(), dict(),
            collections.deque(), iter([]), gen(), 'string', u'unicode',
            f, xrange(1), Foo()]
    tests = [
        (short_str, 'object'),
        (col_iterable, 'Iterable'),
        (col_iterator, 'Iterator'),
        (iter_is_self, 'iter_is_self'),
        (col_sequence, 'Sequence'),
        (col_mutable_sequence, 'MutableSeq'),
        (col_sized, 'Sized'),
        (has_len, 'has_len'),
        (has_iter, 'has_iter'),
        (has_contains, 'has_contains'),
    ]
    funcs, labels = zip(*tests)
    data = [[test(obj) for test in funcs] for obj in objs]
    f.close()
    os.unlink(filename)
    df = pd.DataFrame(data, columns=labels)
    df = df.set_index('object')
    print(df.ix[:, 'Iterable':'MutableSeq'])
    print
    print(df.ix[:, 'Sized':])

isiterable()