5

Given an iterator it, I would like a function it_count that returns the count of elements that iterator produces, without destroying the iterator. For example:

ita = iter([1, 2, 3])
print(it_count(ita))
print(it_count(ita))

should print

3
3

It has been pointed out that this may not be a well-defined question for all iterators, so I am not looking for a completely general solution, but it should function as anticipated on the example given.


Okay, let me clarify further to my specific case. Given the following code:

ita = iter([1, 2, 3])
itb, itc = itertools.tee(ita)
print(sum(1 for _ in itb))
print(sum(1 for _ in itc))

...can we write the it_count function described above, so that it will function in this manner? Even if the answer to the question is "That cannot be done," that's still a perfectly valid answer. It doesn't make the question bad. And the proof that it is impossible would be far from trivial...

Peter O.
  • 32,158
  • 14
  • 82
  • 96
Apollys supports Monica
  • 2,938
  • 1
  • 23
  • 33
  • What have you tried so far? What's tripping you up? – Christopher Apple Apr 13 '17 at 20:08
  • Instead of giving *abstract* examples (which obviously leads to people telling you that this can't be done for the *general* case), can you please describe what specific kind of iterator you're actually dealing with, and why you need to do this? – Lukas Graf Apr 13 '17 at 20:31
  • Please don't change the original intention of the question, please post a new question instead if you have "follow-up"-questions. – MSeifert Apr 13 '17 at 20:32
  • 2
    It's not a follow-up or modification of the original question. It's a clarification for people who didn't understand the intent. – Apollys supports Monica Apr 13 '17 at 20:34
  • The major point of an iterator is that it is executed lazily, you get elements from it as they are needed and doesn't necessarily have a well defined length (so the question isn't ill-defined, the functionality you are asking for is ill-defined) If you were really constructing your iterator from a list literal you'd just get the `len` of the list so what iterator are you actually wondering about? – Tadhg McDonald-Jensen Apr 13 '17 at 20:39
  • @Apollys If you read the documentation of `itertools.tee` then you'll notice the following sentence: "_This itertool may require significant auxiliary storage (depending on how much temporary data needs to be stored). In general, if one iterator uses most or all of the data before another iterator starts, it is faster to use list() instead of tee()._". That I proposed it in my answer was meant as a workaround (but I mentioned that it's really inefficient!). – MSeifert Apr 13 '17 at 20:39
  • Possible duplicate of [Can iterators be reset in Python?](http://stackoverflow.com/questions/3266180/can-iterators-be-reset-in-python) – Matthias Apr 14 '17 at 13:24
  • Clearly none of the answers to that solve this. Rampant duplicate marking on this site, sigh... – Apollys supports Monica Apr 14 '17 at 16:15

4 Answers4

8

Not possible. Until the iterator has been completely consumed, it doesn't have a concrete element count.

user2357112
  • 260,549
  • 28
  • 431
  • 505
4

The only way to get the length of an arbitary iterator is by iterating over it, so the basic question here is ill-defined. You can't get the length of any iterator without iterating over it.

Also the iterator itself may change it's contents while being iterated over, so the count may not be constant anyway.


But there are possibilities that might do what you ask, be warned none of them is foolproof or really efficient:

When using python 3.4 or later you can use operator.length_hint and hope the iterator supports it (be warned: not many iterators do! And it's only meant as a hint, the actual length might be different!):

>>> from operator import length_hint

>>> it_count = length_hint

>>> ita = iter([1, 2, 3])
>>> print(it_count(ita))
3
>>> print(it_count(ita))
3

As alternative: You can use itertools.tee but read the documentation of that carefully before using it. It may solve your issue but it won't really solve the underlying problem.

import itertools

def it_count(iterator):
    return sum(1 for _ in iterator)

ita = iter([1, 2, 3])
it1, it2 = itertools.tee(ita, 2)
print(it_count(it1))  # 3
print(it_count(it2))  # 3

But this is less efficient (memory and speed) than casting it to a list and using len on it.

MSeifert
  • 145,886
  • 38
  • 333
  • 352
  • "you need to know in advance how many "subiterators" you need" - it's poorly documented, but [you can actually generate all the tees you want](http://stackoverflow.com/questions/30232531/in-python-can-i-lazily-generate-copies-of-an-iterator-using-tee) as long as you kept a copy you didn't advance. – user2357112 Apr 13 '17 at 20:21
  • ah, yeah, totally forgot that `tee` supports `copy`. – MSeifert Apr 13 '17 at 20:21
  • I was looking into `itertools.tee`, however, the original iterator will still be consumed. – Apollys supports Monica Apr 13 '17 at 20:23
  • @Apollys Then you need [user2357112](http://stackoverflow.com/a/43400952/5393381)s answer: It's not possible. – MSeifert Apr 13 '17 at 20:25
1

There's no generic way to do what you want. An iterator may not have a well defined length (e.g. itertools.count which iterates forever). Or it might have a length that's expensive to calculate up front, so it won't let you know how far you have to go until you've reached the end (e.g. a file object, which can be iterated yielding lines, which are not easy to count without reading the whole file's contents).

Some kinds of iterators might implement a __length_hint__ method that returns an estimated length, but that length may not be accurate. And not all iterators will implement that method at all, so you probably can't rely upon it (it does work for list iterators, but not for many others).

Often the best way to deal with the whole contents of an iterator is to dump it into a list or other container. After you're done doing whatever operation you need (like calling len on it), you can iterate over the list again. Obviously this requires the iterator to be finite (and for all of its contents to fit into memory), but that's the limitation you have to deal with.

If you only need to peek ahead by a few elements, you might be able to use itertools.tee, but it's no better than dumping into a list if you need to consume the whole contents (since it keeps the values seen by one of its returned iterators but another in a data structure similar to a deque). It wouldn't be any use for finding the length of the iterator.

Blckknght
  • 100,903
  • 11
  • 120
  • 169
  • 2
    `tee` doesn't actually use a `deque` - it stores elements in a specialized [unrolled, singly-linked list](https://en.wikipedia.org/wiki/Unrolled_linked_list). See `teedataobject` and `teeobject` in [`Modules/itertoolsmodule.c`](https://github.com/python/cpython/blob/master/Modules/itertoolsmodule.c#L378). – user2357112 Apr 13 '17 at 20:47
  • You're right that it's not exactly a `deque`, but the linked list implementation for the `teeobject` isn't that different than [the implementation of `deque`](https://github.com/python/cpython/blob/master/Modules/_collectionsmodule.c#L71) which uses a doubly linked list of 64-value blocks. – Blckknght Apr 13 '17 at 21:12
1

I have not been able to come up with an exact solution (because iterators may be immutable types), but here are my best attempts. I believe the second should be faster, according to the documentation (final paragraph of itertools.tee).

Option 1

def it_count(it):
   tmp_it, new_it = itertools.tee(it)
   return sum(1 for _ in tmp_it), new_it

Option 2

def it_count2(it):
   lst = list(it)
   return len(lst), lst

It functions well, but has the slight annoyance of returning the pair rather than simply the count.

ita = iter([1, 2, 3])
count, ita = it_count(ita)
print(count)

Output: 3

count, ita = it_count2(ita)
print(count)

Output: 3

count, ita = it_count(ita)
print(count)

Output: 3

print(list(ita))

Output: [1, 2, 3]
Apollys supports Monica
  • 2,938
  • 1
  • 23
  • 33
  • This may work in some cases but just to give you two examples where it probably won't work: `ita = itertools.count()` or `l = [1, 2, 3]; ita = map(l.append, l)`? Note that these approaches technically "consume" the iterator. – MSeifert Apr 13 '17 at 21:07
  • Of course there's no answer if the generator continues "infinitely," that's why I clarified the question. – Apollys supports Monica Apr 13 '17 at 21:15
  • But based on your example you could just use `operator.length_hint`. What "subset" of iterators are we talking about? – MSeifert Apr 13 '17 at 21:19
  • My clarification does elucidate that... ones for which cloning the iterator using `tee` and then computing `sum(1 for _ in cloned_it)` works. – Apollys supports Monica Apr 13 '17 at 21:30