How is Python's iterator unpacking (star unpacking) implemented (or, what magic methods are involved in unpacking a custom iterator?)

Question

I am writing a class that defines __iter__ and __len__, where the value of __len__ depends on the iterator returned by __iter__. I am getting an interesting RecursionError.

Language versions: Python 3.8.6, 3.7.6. Examples are for illustrating the error only.

In the following example, Iter.__len__() attempts to unpack self, store the result in a list, and then attempts to call the built-in list.__len__() on that list to get the length.

>>> class Iter:
...     def __iter__(self):
...         return range(5).__iter__()
...     def __len__(self):
...         return list.__len__([*self])
...
>>> len(Iter())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 5, in __len__
  File "<stdin>", line 5, in __len__
  File "<stdin>", line 5, in __len__
  [Previous line repeated 993 more times]
  File "<stdin>", line 3, in __iter__
RecursionError: maximum recursion depth exceeded in comparison

However, if I define the class Iter as the following, where Iter.__len__() explicitly unpacks the iterator as returned by Iter.__iter__():

>>> class Iter:
...     def __iter__(self):
...         return range(5).__iter__()
...     def __len__(self):
...         return list.__len__([*self.__iter__()])
...
>>> len(Iter())
5

Then there is no error.

From the traceback, it seems that list.__len__() is trying to call Iter.__len__(), even thought the argument provided is supposedly already a native list object. What is the reason for the RecursionError?

According to schwobaseggl, using set instead of list will not cause a RecursionError:

>>> class Iter:
...     def __iter__(self):
...         return range(5).__iter__()
...     def __len__(self):
...         return set.__len__({*self})
...
>>> len(Iter())
5

For the record. I would make it `len` does NOT have consume the iterator. If you are forced to iterate once, at least use `lru_cache` so you don't need to do it again. Also instead of returning the iter I recommend `yield from` i.e. `yield from range(5)`. — Error - Syntactical Remorse, Dec 23 '20 at 17:23
This is particular about the list `[*self]`. If you use a set in the same way, no error occurs: `{*self}`. Quite interesting. — user2390182, Dec 23 '20 at 17:24
@Error-SyntacticalRemorse The examples are to illustrate the error only. In my actual program, I need live results from the object (backed by data structures that can be modified elsewhere), and performance isn't the main concern, so I suppose evaluating the iterator every time is desired. — tonywu7, Dec 23 '20 at 17:26
Tuples as `(*self,)` will also fail. My best guess is that because of their implementation, `list` and `tuple` try the sequence protocol (`__len__` and `__getitem__`) first which would enable them to allocate space for the underlying array more precisely, and only then go for the iterable protocol. — user2390182, Dec 23 '20 at 17:29
@schwobaseggl Length for lists and tuples is a static value (i.e. not iterated each time). My guess is similar to yours. The list constructor is probably calling `len` of the passed in object which is causing the recursion. Nothing to do with unpacking. Even this following print fails within `__len__`: `print(list(self))` — Error - Syntactical Remorse, Dec 23 '20 at 17:31
@TonyWu I am unsure if you want this (since I think your question is more out of curiosity on why you are getting the error than a better solution), but I recommend implementing `__len__` like this: https://stackoverflow.com/a/390885/8150685 — Error - Syntactical Remorse, Dec 23 '20 at 17:33
@Error-SyntacticalRemorse That's indeed better! Then it doesn't have to store the whole list in the memory — tonywu7, Dec 23 '20 at 17:35
@TonyWu Project specific solution for the link above would be `return sum(1 for _ in self)` in case you wanted it. Or if you are motivated. You can implement the `__bool__` method for the items in the list then just do `sum(self)` :P — Error - Syntactical Remorse, Dec 23 '20 at 17:37
@Error-SyntacticalRemorse Regarding the length as a static value: then my question is, why does `list.__len__`, retrieving the length from a native list object, need to call `Iter.__len__`. If the statement is `[*self.__iter__()]` then by the time `list.__len__` gets the argument, it should already be a constructed list, no? Are there optimizations behind the scene, or what's the precedence here? — tonywu7, Dec 23 '20 at 17:41
@TonyWu `[*self]` calls `len(self)`. And `len(self)` calls `[*self]` again, and so forth... — user2390182, Dec 23 '20 at 17:45
Btw, here is the culprit: https://github.com/python/cpython/blob/master/Objects/listobject.c#L2710 In deed, the list constructor tests whether the passed iterable has a length and tries to use it to allocate memory. Only if that tests fails, it will use extend instead to fill the list. — user2390182, Dec 23 '20 at 17:47
As an aside, don't call `__len__` and `__iter__` directly. Let them be called by `len` and `iter` as intended: `return iter(range(5))` and `return len([*self])`. — chepner, Dec 23 '20 at 17:48
@chepner Yes I am doing that in my code. While I was writing the examples I was trying to be as explicit as possible. I assume that `list.__len__` directly calls the native method, whereas `len()` might try to somehow find `Iter.__len__` which I don't want while debugging. — tonywu7, Dec 23 '20 at 17:53
@schwobaseggl I think this answers the question. Would you mind posting an answer? — tonywu7, Dec 23 '20 at 17:55
@TonyWu btw that `set` logic is a side effect that I would NOT rely on. — Error - Syntactical Remorse, Dec 23 '20 at 18:16

user2390182 · Accepted Answer · 2020-12-24T09:48:43.207

It has little to do with unpacking as such, but with the implementations of different collection types, their constructors in particular.

[*iterable]  # list
(*iterable,) # tuple
{*iterable}  # set

all trigger calls to their classes' respective constructors.

From the current C implementation for list(iterable):

list___init___impl(PyListObject *self, PyObject *iterable) {
    /* ... */
    if (iterable != NULL) {
        if (_PyObject_HasLen(iterable)) {
            Py_ssize_t iter_len = PyObject_Size(iterable);
            if (iter_len == -1) {
                if (!PyErr_ExceptionMatches(PyExc_TypeError)) {
                    return -1;
                }
                PyErr_Clear();
            }
            if (iter_len > 0 && self->ob_item == NULL
                && list_preallocate_exact(self, iter_len)) {
                return -1;
            }
        }
        PyObject *rv = list_extend(self, iterable);
        /* ... */
}

As can be seen (even with such limited C knowledge as mine), the iterable is tested for its size in order to allocate the right amount of memory which is what triggers the calls to __len__ of the passed iterable.

Unsurprisingly, it can be verified that set does no such thing. After all, the relation between the size of the passed iterable and the size of the resulting set is nowhere near as direct as it is for lists or tuples. For instance, think of set([1] * 10**5). It would be foolish to use the size information of the passed list to allocate memory for the set.

On a side note, as has been pointed out in the comments and many other questions/answers on this site (e.g. here):
If you want to determine the length of an iterable, there are more (mainly space-)efficient ways than to collect all items into a Sized collection, e.g.:

def __len__(self):
    return sum(1 for _ in self)

How is Python's iterator unpacking (star unpacking) implemented (or, what magic methods are involved in unpacking a custom iterator?)

1 Answers1