99

Suppose I have an arbitrary iterable - for example, a generator that iterates over lines of a file and yields the ones matching a regex.

How can I count the number of items in that iterable, supposing that I don't care about the elements themselves?

Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • 8
    Please don't use `_` as a variable name, because (1) it tends to confuse people, making them think this is some kind of special syntax, (2) collides with `_` in the interactive interpreter and (3) collides with the common gettext alias. – Sven Marnach Mar 21 '11 at 22:39
  • 9
    @Sven: I use `_` all the time for unused variables (a habit from Prolog and Haskell programming). (1) is a reason for asking this in the first place. I didn't consider (2) and (3), thanks for pointing them out! – Fred Foo Mar 21 '11 at 22:47
  • 2
    duplicated: http://stackoverflow.com/questions/390852/is-there-any-built-in-way-to-get-the-length-of-an-iterable-in-python – tokland Mar 21 '11 at 22:47
  • `python 3.x`, if there exits repeated items and you also want to check the count for each item, use `Counter(generator/iterator)`, eg., `c = Counter(iter('goodbadugly'))`, then count the total: `sum(c.values())` – Kuo Dec 14 '20 at 11:16
  • 1
    @SvenMarnach: Using `_` inside a function, especially inside a genexpr, won't collide with the interactive interpreter (in Py2, using it inside a listcomp at global scope *would* mess with the interactive interpreter's use of `_`, but that was fixed in Py3, where listcomps run in a separate scope). If your function is also using the gettext alias, then yeah, that's a problem, but otherwise, in non-interactive interpreter code, `_` is an accepted way to say "I don't care about the value here", to the point that linters that check for assigned unread names will accept it specifically. – ShadowRanger Sep 30 '21 at 22:36
  • @ShadowRanger My main argument against it is the first one – people _still_ think the underscore has a special meaning, throwing away the result instead of holding on to it, but it doesn't – it's just a regular variable name. And if I have the choice between writing code everyone understands immediately, and code some people have misconceptions about, all else being equal I'll pick the former. However, I've kind of given up this particular fight – it has just become too common. – Sven Marnach Oct 01 '21 at 07:39

8 Answers8

193

Calls to itertools.imap() in Python 2 or map() in Python 3 can be replaced by equivalent generator expressions:

sum(1 for dummy in it)

This also uses a lazy generator, so it avoids materializing a full list of all iterator elements in memory.

Sven Marnach
  • 574,206
  • 118
  • 941
  • 841
  • 4
    you can use `len(list(it))` -- or if the elements are unique, then `len(set(it))` to save a character. – F1Rumors Nov 14 '16 at 05:08
  • 43
    @F1Rumors Using `len(list(it))` is fine in most cases. However, when you have a lazy iterator yielding lots and lots of elements, you don't want to store all of them in memory at the same time just to count them, which is avoided using the code in this answer. – Sven Marnach Nov 14 '16 at 11:11
  • agreed: as an answer, it was predicated on "shortest code" being more important than "lowest memory". – F1Rumors Nov 14 '16 at 15:13
  • 7
    As it has been suggested in this [thread](https://stackoverflow.com/a/7460986/4767645), `sum(1 for _ in generator)` avoids filling the memory. – Sylvain Jun 14 '20 at 20:59
47

Method that's meaningfully faster than sum(1 for i in it) when the iterable may be long (and not meaningfully slower when the iterable is short), while maintaining fixed memory overhead behavior (unlike len(list(it))) to avoid swap thrashing and reallocation overhead for larger inputs:

# On Python 2 only, get zip that lazily generates results instead of returning list
from future_builtins import zip

from collections import deque
from itertools import count

# Avoid constructing a deque each time, reduces fixed overhead enough
# that this beats the sum solution for all but length 0-1 inputs
consumeall = deque(maxlen=0).extend

def ilen(it):
    # Make a stateful counting iterator
    cnt = count()
    # zip it with the input iterator, then drain until input exhausted at C level
    consumeall(zip(it, cnt)) # cnt must be second zip arg to avoid advancing too far
    # Since count 0 based, the next value is the count
    return next(cnt)

Like len(list(it)) it performs the loop in C code on CPython (deque, count and zip are all implemented in C); avoiding byte code execution per loop is usually the key to performance in CPython.

It's surprisingly difficult to come up with fair test cases for comparing performance (list cheats using __length_hint__ which isn't likely to be available for arbitrary input iterables, itertools functions that don't provide __length_hint__ often have special operating modes that work faster when the value returned on each loop is released/freed before the next value is requested, which deque with maxlen=0 will do). The test case I used was to create a generator function that would take an input and return a C level generator that lacked special itertools return container optimizations or __length_hint__, using Python 3.3+'s yield from:

def no_opt_iter(it):
    yield from it

Then using ipython %timeit magic (substituting different constants for 100):

>>> %%timeit fakeinput = (0,) * 100
... ilen(no_opt_iter(fakeinput))

When the input isn't large enough that len(list(it)) would cause memory issues, on a Linux box running Python 3.9 x64, my solution takes about 50% longer than def ilen(it): return len(list(it)), regardless of input length.

For the smallest of inputs, the setup costs to load/call consumeall/zip/count/next means it takes infinitesimally longer this way than def ilen(it): sum(1 for _ in it) (about 40 ns more on my machine for a length 0 input, a 10% increase over the simple sum approach), but by the time you hit length 2 inputs, the cost is equivalent, and somewhere around length 30, the initial overhead is unnoticeable compared to the real work; the sum approach takes roughly 50% longer.

Basically, if memory use matters or inputs don't have bounded size and you care about speed more than brevity, use this solution. If inputs are bounded and smallish, len(list(it)) is probably best, and if they're unbounded, but simplicity/brevity counts, you'd use sum(1 for _ in it).

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
  • 1
    This is exactly the implementation in `more_itertools.ilen`. – rsalmei Jul 11 '19 at 01:01
  • 5
    @rsalmei: Looks like they [adopted my implementation eight months ago](https://github.com/erikrose/more-itertools/commit/5161c3455375492ce9dfb4ad32a2e5ee1506f966). Technically, it's slightly slower (because they passed `maxlen` by keyword, not positionally), but that's fixed overhead, not meaningful on big-O runtime. Either way, they copied me (I posted this 3.5 years ago), not the other way around. :-) – ShadowRanger Jul 11 '19 at 01:07
  • Nice solution. As an observation--if it is "surprisingly difficult to come up with fair test cases for comparing performance," then perhaps there is no general solution of worth and it would be best to time the different implementations (this one, `sum(1 ..)`, `len(list())`, etc.) to one's particular situation. – user650654 Mar 07 '20 at 20:58
  • @user650654: Some of the difficulty is in the fact that, in a test case, you need to run it many times, without paying the cost of recreating an iterator over and over (which would hide performance differences). In the real world, you're not concerned with making fake inputs cheaply; you have the input, you need to count it once, then you're done (and there are lots of things that would behave like my test case input, they're just expensive to recreate). That said, I agree specific situations call for different approaches; that's what my final paragraph was all about. – ShadowRanger Jun 25 '21 at 15:35
  • @ShadowRanger I very much doubt they were copying you, more likely they are copying the [itertools consume recipe](https://docs.python.org/3/library/itertools.html#itertools-recipes) which has been there [since 2009](https://github.com/python/cpython/commit/fa007965c8cf15a45bb38b3b7432c6df1949c43f). – wim Nov 28 '22 at 06:05
  • @wim: The `consume` recipe is not the copied part of it (I actually tweaked how I implemented that subset of `consume` in a later edit, but it's hardly original). It's the use of `zip` and `itertools.count` with `consume` in a way that avoids storing the results of `zip` (thereby enabling the `zip` optimization that reuses the `tuple` instead of allocating new ones) that got copied; the implementation proposed in [`more-itertools` #230](https://github.com/more-itertools/more-itertools/issues/230) is a near-perfect copy of the code here at the time it was opened. – ShadowRanger Nov 28 '22 at 16:26
  • To be clear, I don't really care if they copied me (there was a smiley there to indicate levity at the final statement), just pointing out that, at the time I posted, and for 3.5 years thereafter, `more_itertools.ilen` used slower code, then adopted, near verbatim, the version of the code I posted here at that time. – ShadowRanger Nov 28 '22 at 16:28
9

A short way is:

def ilen(it):
    return len(list(it))

Note that if you are generating a lot of elements (say, tens of thousands or more), then putting them in a list may become a performance issue. However, this is a simple expression of the idea where the performance isn't going to matter for most cases.

Greg Hewgill
  • 951,095
  • 183
  • 1,149
  • 1,285
  • 1
    I'd thought of this, but performance does matter as I often process large text files. – Fred Foo Mar 21 '11 at 22:57
  • 9
    As long as you don't run out of memory, this solution is actually quite good performance-wise, since this will do the loop in pure C code -- all the objects have to be generated anyway. Even for big iterators this is faster than `sum(1 for i in it)` as long as as everything fits into memory. – Sven Marnach Mar 21 '11 at 23:18
  • 1
    It's actually crazy, that `len(it)` doesn't work. `sum(it)`, `max(it)`, `min(it)` and so on work as expected, only `len(it)` doesn't. – Kai Petzke Jul 21 '18 at 00:59
  • 5
    @KaiPetzke: When `it` is an iterator, there is no guarantee it knows its own length without running it out. The most obvious example being file objects; they have a length based on the number of lines in the file, but lines are variable length, and the only way to know how many lines their are is to read the whole file and count newlines. `len()` is intended to be a cheap `O(1)` operation; do you want it silently reading through multi-GB files when you ask for their length? `sum`, `max` and `min` are aggregation functions that must read their data, `len` isn't. – ShadowRanger Nov 08 '18 at 15:16
  • @ShadowRanger: An option might be to add an O(n) aggregate `count(it)`. – Kai Petzke Dec 14 '18 at 09:31
7

more_itertools is a third-party library that implements an ilen tool. pip install more_itertools

import more_itertools as mit


mit.ilen(x for x in range(10))
# 10
pylang
  • 40,867
  • 14
  • 129
  • 121
4
len(list(it))

Although, it can hang up if it's an infinite generator.

Nikhil CSB
  • 155
  • 9
2

I like the cardinality package for this, it is very lightweight and tries to use the fastest possible implementation available depending on the iterable.

Usage:

>>> import cardinality
>>> cardinality.count([1, 2, 3])
3
>>> cardinality.count(i for i in range(500))
500
>>> def gen():
...     yield 'hello'
...     yield 'world'
>>> cardinality.count(gen())
2
Erwin Mayer
  • 18,076
  • 9
  • 88
  • 126
2

These would be my choices either one or another:

print(len([*gen]))
print(len(list(gen)))
prosti
  • 42,291
  • 14
  • 186
  • 151
  • 1
    There seems to be little point in the first option, as it would just add the overhead of expanding the entire generator before converting it to a `list`. Meaning this answer adds nothing of value over other answers, unless you can explain *why* the first option has any merit. – jpmc26 Sep 23 '19 at 15:19
  • @jpmc26, the OP asked for the shortest way to count the number of elements in the generator. `len([*gen])` is pretty short. This would be valuable in Code Golf, for instance. However, I agree with you that in most use cases this solution is suboptimal. – ruancomelli Apr 28 '20 at 21:17
  • Actually, "the shortest way" is written in the title, but the question body is rather different. `len([*gen])` feels unpythonic to me. – ruancomelli Apr 28 '20 at 21:19
  • @jpmc26 Python doesn't really have "conversions"; *creating* a list from a generator - by passing it to `list` - *does the same work* that the `[*gen]` trick does. For example, on my machine `python -m timeit "len([*(_ for _ in range(100))])"` shows slightly better performance than `python -m timeit "len(list(_ for _ in range(100)))"` (presumably since the name `list` doesn't need to be looked up in the global namespace). – Karl Knechtel Apr 28 '23 at 19:26
0

In case you want to use the iterable elsewhere and know how many elements were consumed, you can create a simple wrapper class:

from collections.abc import Iterable, Iterator
from typing import Generic, TypeVar

_T = TypeVar("_T")


class IterCounter(Generic[_T]):
    """Iterator that keeps count of the consumed elements"""

    def __init__(self, iterable: Iterable[_T]) -> None:
        self._iterator = iter(iterable)
        self.count = 0

    def __iter__(self) -> Iterator[_T]:
        return self

    def __next__(self) -> _T:
        element = next(self._iterator)
        self.count += 1
        return element


counter = IterCounter(range(5))

print(counter.count)  # 0

print(list(counter))  # [0, 1, 2, 3, 4]

print(counter.count)  # 5
Alexander
  • 119
  • 1
  • 11