10

I have these two implementations to compute the length of a finite generator, while keeping the data for further processing:

def count_generator1(generator):
    '''- build a list with the generator data
       - get the length of the data
       - return both the length and the original data (in a list)
       WARNING: the memory use is unbounded, and infinite generators will block this'''
    l = list(generator)
    return len(l), l

def count_generator2(generator):
    '''- get two generators from the original generator
       - get the length of the data from one of them
       - return both the length and the original data, as returned by tee
       WARNING: tee can use up an unbounded amount of memory, and infinite generators will block this'''
    for_length, saved  = itertools.tee(generator, 2)
    return sum(1 for _ in for_length), saved

Both have drawbacks, both do the job. Could somebody comment on them, or even offer a better alternative?

blueFast
  • 41,341
  • 63
  • 198
  • 344
  • 3
    There is no way to know the length of an iterable generator without consuming the entire thing. – Jakob Bowyer Aug 02 '13 at 10:15
  • 2
    I know. That is not the question – blueFast Aug 02 '13 at 10:16
  • 2
    note: if if you don't need the precise length then you could use [`operator.length_hint()` (Python 3.4+)](http://docs.python.org/dev/library/operator#operator.length_hint) that returns an estimated length without consuming the iterator. See [PEP 424 - A method for exposing a length hint](http://www.python.org/dev/peps/pep-0424/) – jfs Aug 02 '13 at 10:54
  • @J.F.Sebastian That's a nice addition for 3.4 – Gareth Latty Aug 02 '13 at 10:59
  • Thanks, @J.F.Sebastian, but not clear if `length_hint` will consume the data or not. I assume not, otherwise, what is the point? Anyway, it is too magic for my taste. – blueFast Aug 02 '13 at 11:28
  • 2
    @gonvaled: length_hint will call __length_hint__(), which is tricky to implement on a generator. – Lennart Regebro Aug 05 '13 at 08:24

3 Answers3

13

If you have to do this, the first method is much better - as you consume all the values, itertools.tee() will have to store all the values anyway, meaning a list will be more efficient.

To quote from the docs:

This itertool may require significant auxiliary storage (depending on how much temporary data needs to be stored). In general, if one iterator uses most or all of the data before another iterator starts, it is faster to use list() instead of tee().

Gareth Latty
  • 86,389
  • 17
  • 178
  • 183
  • Well, in both cases I am consuming the generator, and storing the full data. In the first by creating a `list`, in the second just because `tee` has to do the same (or a similar thing). I think that getting the length of the list is faster (already part of the list object?), that is why I tend to prefer the first method. From the point of view of memory consumption, both seem equivalent, right? – blueFast Aug 02 '13 at 10:24
  • @gonvaled Memory usage will probably be similar, but as I quote from the docs, making a list will be faster. – Gareth Latty Aug 02 '13 at 10:27
4

I ran Windows 64-bit Python 3.4.3 timeit on a few approaches I could think of:

>>> from timeit import timeit
>>> from textwrap import dedent as d
>>> timeit(
...     d("""
...     count = -1
...     for _ in s:
...         count += 1
...     count += 1
...     """),
...     "s = range(1000)",
... )
50.70772041983173
>>> timeit(
...     d("""
...     count = -1
...     for count, _ in enumerate(s):
...         pass
...     count += 1
...     """),
...     "s = range(1000)",
... )
42.636973504498656
>>> timeit(
...     d("""
...     count, _ = reduce(f, enumerate(range(1000)), (-1, -1))
...     count += 1
...     """),
...     d("""
...     from functools import reduce
...     def f(_, count):
...         return count
...     s = range(1000)
...     """),
... )
121.15513102540672
>>> timeit("count = sum(1 for _ in s)", "s = range(1000)")
58.179126025925825
>>> timeit("count = len(tuple(s))", "s = range(1000)")
19.777029680237774
>>> timeit("count = len(list(s))", "s = range(1000)")
18.145157531932
>>> timeit("count = len(list(1 for _ in s))", "s = range(1000)")
57.41422175998332

Shockingly, the fastest approach was to use a list (not even a tuple) to exhaust the iterator and get the length from there:

>>> timeit("count = len(list(s))", "s = range(1000)")
18.145157531932

Of course, this risks memory issues. The best low-memory alternative was to use enumerate on a NOOP for-loop:

>>> timeit(
...     d("""
...     count = -1
...     for count, _ in enumerate(s):
...         pass
...     count += 1
...     """),
...     "s = range(1000)",
... )
42.636973504498656

Cheers!

John Crawford
  • 2,144
  • 2
  • 19
  • 16
0

If you don't need the length of the iterator prior to processing the data, you could use a helper method with a future to add in counting into the processing of your iterator/stream:

import asyncio
def ilen(iter):
    """
    Get future with length of iterator
    The future will hold the length once the iteartor is exhausted
    @returns: <iter, cnt-future>
    """
    def ilen_inner(iter, future):
        cnt = 0
        for row in iter:
            cnt += 1
            yield row
        future.set_result(cnt)
    cnt_future = asyncio.Future()
    return ilen_inner(iter, cnt_future), cnt_future

Usage would be:

data = db_connection.execute(query)
data, cnt = ilen(data)
solve_world_hunger(data)
print(f"Processed {cnt.result()} items")
Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
Vajk Hermecz
  • 5,413
  • 2
  • 34
  • 25