Python generator vs list as array initializer

Question

Here's an example of initializing an array of ten million random numbers, using a list (a), and using tuple-like generator (b). The result is exactly the same, the list or tuple is never used, so there's no practical advantage with one or the other

from random import randint
from array import array

a = array('H', [randint(1, 100) for _ in range(0, 10000000)])
b = array('H', (randint(1, 100) for _ in range(0, 10000000)))

So the question is which one to use. In principle, my understanding is that that a tuple should be able to get away with using less resources than a list, but since this list and tuple are not kept, it should be possible that the code is executed without ever initializing the intermediate data structure… My tests indicate that the list is slightly faster in this case. I can only imagine that this is because the Python implementation has more optimization around lists than tuples. Can I expect this to be consistent?

More generally, should I use one or the other, and why? (Or should I do this kind initialization some other way completely.)

Update: Answers and comments made me realize that the b example is not actually a tuple but a generator, so I edited a bit in the headline and the text above to reflect that. Also I tried splitting the list version into two lines like this, which should force the list to actually be instantiated:

g = [randint(1, 100) for _ in range(0, 10000000)]
a = array('H', g)

It appears to make no difference. The list version takes about 8.5 seconds, and the generator version takes about 9 seconds.

"If given a **list** or **string**, the initializer is passed to the new array’s `fromlist()`, `frombytes()`, or `fromunicode()` method to add initial items to the array. Otherwise, the **iterable** initializer is passed to the `extend()` method." (from [here](https://docs.python.org/3.8/library/array.html#array.array)) — dirkgroten, Feb 09 '20 at 12:28
In generator, the elements are calculated **lazily**. And you can iterate through them **only once**. — Ch3steR, Feb 09 '20 at 12:29
*"it should be possible that the code is executed without ever initializing the intermediate data structure"* - possible in theory, but not done in practice. — kaya3, Feb 09 '20 at 12:29
AFAIK, list comprehension is very efficient in Python. However, if you're mentioning timing results, could you also share how you've measured both timings? The question about which one (list or tuple) you should use is very interesting, but I'm not sure if fits the Q&A form required by Stack Overflow. I think it's hugely opinion-based. — wovano, Feb 09 '20 at 13:33

Grismar · Answer 1 · 2020-02-10T03:16:46.997

2

Although it looks like it, (randint(1, 100) for _ in range(0, 1000000)) is not a tuple, it's a generator:

>>> type((randint(1, 100) for _ in range(0, 1000000)))
<class 'generator'>
>>>

If you really want a tuple, use:

b = array('H', tuple(randint(1, 100) for _ in range(0, 1000000)))

The list being a bit faster than the generator makes sense, since the generator generates the next value when asked, one at a time, while the list comprehension allocates all the memory needed and then proceeds to fill it with values all in one go. That optimisation for speed is paid for in memory space.

I'd favour the generator, since it will work regardless of most reasonable memory restrictions and would work for any number of random numbers, while the speedup of the list is minimal. Unless you need to generate this list again and again, at which time the speedup would start to count - but then you'd probably use the same copy of the list each time to begin with.

edited Feb 10 '20 at 03:16

answered Feb 09 '20 at 12:28

Grismar

27,561
4
31
54

The list version is slightly faster than the generator version, though! – kaya3 Feb 09 '20 at 12:29
The speed difference is probably an implementation detail of `array.append()` vs `array.extend()`. It's hard to generalise why the iterator is slower than the list. – dirkgroten Feb 09 '20 at 12:33
3

@dirkgroten At C-level, a list comprehension is just a loop, while a generator is... a generator. More code is executed every time you call `next()` on a generator. That's why list comprehensions are almost always faster than generator expressions. – iBug Feb 09 '20 at 12:36
1

@iBug good explanation. `fromlist()` indeed just does a list comprehension whereas `extend()` needs to call `next()` on the iterator before calling `append()`. – dirkgroten Feb 09 '20 at 12:40
@iBug So you're saying that the list version doesn't produce a “physical” intermediate data structure containing those numbers either? Your answer is kind of what I expected and consistent with the behavior I see, but then, what about the new version in my edited question, the one with a ```g``` variable? This is as fast as the original list version, and it should force a physical list into existence, no? – njlarsson Feb 09 '20 at 20:26
@njlarsson Yes. The list comprehension calculates all items at once and returns all of them, and generally takes more memory than the generator expression. It's a memory-time tradeoff. The new version involving a `g` variables makes hardly any difference with the `a` one in your original code, since both list comprehensions are "expanded" at once, except that you may reuse the list later as it's assigned to `g`. – iBug Feb 10 '20 at 02:58

score 1 · Accepted Answer · answered Feb 10 '20 at 03:18

[randint(1, 100) for _ in range(0, 10000000)]

This is a list comprehension. Every element is evaluated in a tight loop and put together into a list, so it is generally faster but takes more RAM (everything comes out at once).

(randint(1, 100) for _ in range(0, 10000000))

This is a generator expression. No element is evaluated at this point, and one of them comes out at a time when you call next() on the resulting generator. It's slower but takes a consistent (small) amount of memory.

As given in the other answer, if you want a tuple, you should convert either into one:

tuple([randint(1, 100) for _ in range(0, 10000000)])
tuple(randint(1, 100) for _ in range(0, 10000000))

Let's come back to your question:

When to use which?

In general, if you use a list comprehension or generator expression as an initializer of another sequential data structure (list, array, etc.), it makes no difference except for the memory-time tradeoff mentioned above. Things you need to consider is as simple as performance and memory budget. You would prefer the list comprehension if you need more speed (or write a C program to be absolutely fast) or the generator expression if you need to keep the memory consumption low.

If you plan to reuse the resulting sequence, things start to get interesting.

A list is strictly a list, and can for all purposes be used as a list:

a = [i for i in range(5)]
a[3]  # 3
a.append(5)            # a = [0, 1, 2, 3, 4, 5]
for _ in a:
    print("Hello")
                       # Prints 6 lines in total
for _ in a:
    print("Bye")
                       # Prints another 6 lines
b = list(reversed(a))  # b = [5, 4, 3, 2, 1, 0]

A generator can be only used once.

a = (i for i in range(5))
a[3]                   # TypeError: generator object isn't subscriptable
a.append(5)            # AttributeError: generator has no attribute 'append'
for _ in a:
    print("Hello")
                       # Prints 5 lines in total
for _ in a:
    print("Bye")
                       # Nothing this time, because
                       # the generator has already been consumed
b = list(reversed(a))  # TypeError: generator isn't reversible

The final answer is: Know what you want to do, and find the appropriate data structure for it.

A clear case for the generator in my case then, because the reason for using an ```array('H')``` in the first place is to be conservative with space. What bugs me, though, is that if ```next()``` is the generator's sole interface, there's no way to know the final size of the array until the generator is finished. This should mean that the array size is doubled O(log *n*) times when building an size *n* array, potentially using O(*n*) extra time and auxiliary space. It would have been nice if there was a way to predefine the size of an array… — njlarsson, Feb 10 '20 at 07:28
@njlarsson Unfortunately, that seems to be the case. Generators only have `next()` so you'd better off figuring out the total size beforehand. — iBug, Feb 10 '20 at 07:48

Python generator vs list as array initializer

2 Answers2

When to use which?