Understanding len function with iterators

Question

Reading the documentation I have noticed that the built-in function len doesn't support all iterables but just sequences and mappings (and sets). Before reading that, I always thought that the len function used the iteration protocol to evaluate the length of an object, so I was really surprised reading that.

I read the already-posted questions (here and here) but I am still confused, I'm still not getting the real reason why not allow len to work with all iterables in general.

Is it a more conceptual/logical reason than an implementational one? I mean when I'm asking the length of an object, I'm asking for one property (how many elements it has), a property that objects as generators don't have because they do not have elements inside, the produce elements.

Furthermore generator objects can yield infinite elements bring to an undefined length, something that can not happen with other objects as lists, tuples, dicts, etc...

So am I right, or are there more insights/something more that I'm not considering?

I don't think you're going to get a better answer than what you've already seen. — user2357112, Sep 08 '14 at 15:51
This puzzles me as well. Why does python support `sum` or `all` on generators, but not `len`? It's basically the same kind of thing. There must be an explanation somewhere in PEPs or mailing lists... — georg, Sep 08 '14 at 15:55
What would you use `len` for on an iterator? The only way to find out how long it is is to iterate over it, so once you've found out how long it is (assuming it isn't infinitely long) you've already consumed it and can't use it for anything else. — jonrsharpe, Sep 08 '14 at 15:58
On second thought, I guess [this](http://stackoverflow.com/a/11463097/989121) answers the question: because you cannot _define_ what `len(g)` is. "How many times `g` is going to loop?" - the answer to this is generally "it depends". If you agree, I think we can dupe-close this. — georg, Sep 08 '14 at 16:01
I don't know the original reason. For me, I expect `len()` to be O(1), while `sum()` or `all()` are clearly O(n). — Robᵩ, Sep 08 '14 at 16:10
@georg That doesn't address the use of `sum` or `all`, which are just more generalized version of the same loop requirement. That is: if `sum` is allowed to hang because you called it on an infinite generator, why cannot `len` hang when called the same way? — ely, Sep 08 '14 at 16:10
@Robᵩ For dynamically resizable array objects that do not store a member variable for the length, computing length has to be at least O(n). Only for data structures that pay some O(n) cost to pre-calculate the value and store it as a member variable can you (later on, after that pre-calculation) get an O(1) length. This is different from many compiled languages where array lengths must be declared explicitly and the compiler can easily provide it as a member variable. For generators, this is not possible since the "length" is a mathematical property of the definition and not known in advance. — ely, Sep 08 '14 at 16:28
In my experience, it's much more common for length to be an O(n) operation in practical settings (streaming data, dynamically read data from files or web services) than it is to be O(1) (needing the length of a user-defined array with a fixed, known-in-advance size). — ely, Sep 08 '14 at 16:28
@EMS - Your generalizations may be true. In **Python**, I expect `len()` to always be `O(1)` and non-destructive. Thus, I do not expect it be valid against iterators. — Robᵩ, Sep 08 '14 at 16:29
@Robᵩ Sorry what I mean was specifically in Python, with the `__len__` protocol, it is *not* O(1). This is in contrast to something like Java, where the array size is declared in advance, so "computing the length" is the same as "just looking up the value of a member attribute that stores the length", which is O(1). This is rarely what happens in Python, since the data structures don't keep those member variables. `list`, `set`, `tuple`, and `dict` might, I don't know, but I'd be surprised if `set`, `list` and `dict` did since their lengths can change. But most container objects *do not*. — ely, Sep 08 '14 at 16:32
Just to clarify: I'm saying that in a dynamically typed language like Python, the fact that some of the language's builtin container types provide O(1) length look-up (via O(n) pre-calculations when the size changes) is just a convention. It's a convention of those builtin types and not something that is common on a wide range of container types, and probably not at all a good thing to expect. Contrasted with a statically typed language, where it is much more reasonable that container lengths must be declared ahead of time and are always reasonably known in advance. — ely, Sep 08 '14 at 16:38
It's exactly things like iterator protocol that highlight the importance of this in Python. My experience is that *most* containers used in real code are custom classes, third-party containers, or generators. Obviously there is heavy use of builtin containers, but it's not so overwhelmingly dominant that their conventional O(1) length should be considered "expected" of the length operation in Python. — ely, Sep 08 '14 at 16:39
@EMS: Without a member holding the length of a container, the only way to know where the end is would be a sentinel value. Sentinel values are a much bigger headache than a length member. Imagine if appending to a list required you to iterate through the whole thing just to find the end of it. Also, maintaining the length isn't O(n); it's O(1) per potentially-size-changing operation, and it saves more time than it takes. — user2357112, Sep 08 '14 at 17:12
I'm not sure the point of your comment. These are known properties of array types that use a member to hold the length. It's still the case that many don't (e.g. generators in Python) and that it's prudent not to expect it to be O(1) despite the common builtin containers choosing to implement it that way. — ely, Sep 08 '14 at 17:17
Generally, `__len__` is expected to be O(1), and anything with a non-O(1) `__len__` is expected to document this property. — user2357112, Sep 08 '14 at 17:19
Okay, no more c-word. What data types with non-O(1) `__len__` can you name? Do any of them not document this property? — user2357112, Sep 08 '14 at 17:21

score 10 · Accepted Answer · answered Sep 08 '14 at 19:19

The biggest reason is that it reduces type safety.

How many programs have you written where you actually needed to consume an iterable just to know how many elements it had, throwing away anything else?

I, in quite a few years of coding in Python, never needed that. It's a non-sensical operation in normal programs. An iterator may not have a length (e.g. infinite iterators or generators that expects inputs via send()), so asking for it doesn't make much sense. The fact that len(an_iterator) produces an error means that you can find bugs in your code. You can see that in a certain part of the program you are calling len on the wrong thing, or maybe your function actually needs a sequence instead of an iterator as you expected.

Removing such errors would create a new class of bugs where people, calling len, erroneously consume an iterator, or use an iterator as if it were a sequence without realizing.

If you really need to know the length of an iterator, what's wrong with len(list(iterator))? The extra 6 characters? It's trivial to write your own version that works for iterators, but, as I said, 99% of the time this simply means that something with your code is wrong, because such an operation doesn't make much sense.

The second reason is that, with that change, you are violating two nice properties of len that currently hold for all (known) containers:

It is known to be cheap on all containers ever implemented in Python (all built-ins, standard library, numpy & scipy and all other big third party libraries do this on both dynamically sized and static sized containers). So when you see len(something) you know that the len call is cheap. Making it work with iterators would mean that suddenly all programs might become inefficient due to computations of the length.

Also note that you can, trivially, implement O(1) __len__ on every container. The cost to pre-compute the length is often negligible, and generally worth paying. The only exception would be if you implement immutable containers that have part of their internal representation shared with other instances (to save memory). However, I don't know of any implementation that does this, and most of the time you can achieve better than O(n) time anyway.

In summary: currently everybody implements __len__ in O(1) and it's easy to continue to do so. So there is an expectation for calls to len to be O(1). Even if it's not part of the standard. Python developers intentionally avoid C/C++'s style legalese in their documentation and trust the users. In this case, if your __len__ isn't O(1), it's expected that you document that.
It is known to be not destructive. Any sensible implementation of __len__ doesn't change its argument. So you can be sure that len(x) == len(x), or that n = len(x);len(list(x)) == n.

Even this property is not defined in the documentation, however it's expected by everyone, and currently, nobody violates it.

Such properties are good, because you can reason and make assumptions about code using them. They can help you ensure the correctness of a piece of code, or understand its asymptotic complexity. The change you propose would make it much harder to look at some code and understand whether it's correct or what would be it's complexity, because you have to keep in mind the special cases.

In summary, the change you are proposing has one, really small, pro: saving few characters in very particular situations, but it has several, big, disadvantages which would impact a huge portion of existing code.

An other minor reason. If len consumes iterators I'm sure that some people would start to abuse this for its side-effects (replacing the already ugly use of map or list-comprehensions). Suddenly people can write code like:

len(print(something) for ... in ...)

to print text, which is really just ugly. It doesn't read well. Stateful code should be relagated to statements, since they provide a visual cue of side-effects.

Understanding len function with iterators

1 Answers1

Linked