Why is list(x for x in a) faster for a=[0] than for a=[]?

Question

I tested list(x for x in a) with three different CPython versions. On a = [0] it's significantly faster than on a = []:

 3.9.0 64-bit       3.9.0 32-bit       3.7.8 64-bit
a = []  a = [0]    a = []  a = [0]    a = []  a = [0]

465 ns  412 ns     543 ns  515 ns     513 ns  457 ns   
450 ns  406 ns     544 ns  515 ns     506 ns  491 ns   
456 ns  408 ns     551 ns  513 ns     515 ns  487 ns   
455 ns  413 ns     548 ns  516 ns     513 ns  491 ns   
452 ns  404 ns     549 ns  511 ns     508 ns  486 ns

With tuple instead of list, it's the expected other way around:

 3.9.0 64-bit       3.9.0 32-bit       3.7.8 64-bit
a = []  a = [0]    a = []  a = [0]    a = []  a = [0]

354 ns  405 ns     467 ns  514 ns     421 ns  465 ns   
364 ns  407 ns     467 ns  527 ns     425 ns  464 ns   
353 ns  399 ns     490 ns  549 ns     419 ns  465 ns   
352 ns  400 ns     500 ns  556 ns     414 ns  474 ns   
354 ns  405 ns     494 ns  560 ns     420 ns  474 ns

So why is list faster when it (and the underlying generator iterator) has to do more?

Tested on Windows 10 Pro 2004 64-bit.

Benchmark code:

from timeit import repeat

setups = 'a = []', 'a = [0]'
number = 10**6

print(*setups, sep='   ')
for _ in range(5):
    for setup in setups:
        t = min(repeat('list(x for x in a)', setup, number=number)) / number
        print('%d ns' % (t * 1e9), end='   ')
    print()

Byte sizes, showing that it doesn't overallocate for input [] but does for input [0]:

>>> [].__sizeof__()
40
>>> list(x for x in []).__sizeof__()
40

>>> [0].__sizeof__()
48
>>> list(x for x in [0]).__sizeof__()
72

I wonder if it is related to pre-allocation. IIRC, the empty list produces a list object with a default amount of space allocated for future elements. With the non-empty list, `list` will *only* allocate the amount necessary to hold the elements from the iterator. Tuples, since they cannot grow, *always* allocate only enough space to hold what comes from the iterator. — chepner, Oct 19 '20 at 13:27
@chepner Which list do you mean? The original `a` or the created shallow copy? `a` is created during setup, and neither `list(...)` nor `tuple(...)` can know the size of the result in advance. Just tried, `(x for x in []).__length_hint__()` gives an error (unlike `iter([]).__length_hint__()`, which returns `0`). — Kelly Bundy, Oct 19 '20 at 13:32
@chepner Another test: `list(x for x in []).__sizeof__()` says 40 bytes while `list(x for x in [0]).__sizeof__()` says 72 bytes. — Kelly Bundy, Oct 19 '20 at 13:41
I cannot reproduce your results (Python 3.7 64 bit/Win 10 Pro). Sometimes `[]` is faster, sometimes `[0]` — Ocaso Protal, Oct 20 '20 at 07:39
@OcasoProtal weird, I see `[]` taking ~2x the time for `[0]` pretty consistently on Python 3.7 64bit Win10 Pro — Pranav Hosangadi, Oct 22 '20 at 17:25
@HeapOverflow I cannot reproduce your results either; for reference, I obtained: `[]: 463 ns ± 2.69 ns per loop (mean ± std. dev. of 15 runs, 10000000 loops each)` and `[0]: 471 ns ± 3.4 ns per loop (mean ± std. dev. of 15 runs, 10000000 loops each)`. You should also check std.dev. to make sure your results don't vary too much. Also please provide details on how you obtained these distributions of Python. — a_guest, Oct 22 '20 at 21:22
@a_guest You mean the distributions I show in the question? They're produced by the script shown in the question. And for each scenario I show five times, each of which is from five repetitions of a million repetitions. And I ran the script multiple times and always got similar results. So that's multiple multiple multiple multiple executions. Always showing small deviations, like, in the first block, five times from 450 to 465 ns vs five times from 404 to 413 ns. That's pretty clear, isn't it? Does it really additionally need a std.dev to be convincing? How did you obtain your ... — Kelly Bundy, Oct 23 '20 at 02:51
... results, btw? Maybe I should switch to that format, i.e., instead of five times just show one plus std dev. On the other hand, I find five times more understandable than a std dev, plus since I do it in round-robin fashion (i.e., alternate between [] and [0] for all ten measurements), I'm more confident that I have stable CPU speeds. If I do it like I suspect you did, first measure [] and then [0], maybe one appears slower than the other just because the CPU was slower or the program got less CPU share during the run. So I'd have to show multiple results anyway. — Kelly Bundy, Oct 23 '20 at 02:59
@HeapOverflow I mean the result returned by `repeat` of which you take the `min`. Anyway it's of course better to perform multiple runs since with multiple results one can perform a proper statistical test whether the two run-times are actually different. For reference, I did the test from within ipython via [`%timeit`](https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-timeit). — a_guest, Oct 23 '20 at 07:31

score 39 · Accepted Answer · edited Oct 29 '20 at 15:32

What you observe, is that pymalloc (Python memory manager) is faster than the memory manager provided by your C-runtime.

It is easy to see in the profiler, that the main difference between both versions is that list_resize and _PyObjectRealloc need more time for the a=[]-case. But why?

When a new list is created from an iterable, the list tries to get a hint how many elements are in the iterator:

n = PyObject_LengthHint(iterable, 8);

However, this doesn't work for generators and thus the hint is the default value 8.

After the iterator is exhausted, the list tries to shrink, because there are only 0 or 1 element (and not the original capacity allocated due to a too large size-hint). For 1 element this would lead to (due to over-allocation) capacity of 4 elements. However, there is a special handling for the case of 0 elements: it will not be over-allocated:

// ...
if (newsize == 0)
        new_allocated = 0;
num_allocated_bytes = new_allocated * sizeof(PyObject *);
items = (PyObject **)PyMem_Realloc(self->ob_item, num_allocated_bytes);
// ...

So in the "empty" case, PyMem_Realloc will be asked for 0 bytes. This call will be passed via _PyObject_Malloc down to pymalloc_alloc, which in case of 0 bytes returns NULL:

if (UNLIKELY(nbytes == 0)) {
   return NULL;
}

However, _PyObject_Malloc falls back to the "raw" malloc, if pymalloc returns NULL:

static void *
_PyObject_Malloc(void *ctx, size_t nbytes)
{
    void* ptr = pymalloc_alloc(ctx, nbytes);
    if (LIKELY(ptr != NULL)) {
        return ptr;
    }

    ptr = PyMem_RawMalloc(nbytes);
    if (ptr != NULL) {
        raw_allocated_blocks++;
    }
    return ptr;
}

as can be easily seen in the definition of _PyMem_RawMalloc:

static void *
_PyMem_RawMalloc(void *ctx, size_t size)
{
    /* PyMem_RawMalloc(0) means malloc(1). Some systems would return NULL
       for malloc(0), which would be treated as an error. Some platforms would
       return a pointer with no memory behind it, which would break pymalloc.
       To solve these problems, allocate an extra byte. */
    if (size == 0)
        size = 1;
    return malloc(size);
}

Thus, the case a=[0] will use pymalloc, while a=[] will use the memory manager of the underlying c-runtime, which explains the observed difference.

Now, this all can be seen as missed optimization, because for newsize=0, we could just set the ob_item to NULL, adjust other members and return.

Let's try it out:

static int
list_resize(PyListObject *self, Py_ssize_t newsize)
{
    // ...
    if (newsize == 0) {
        PyMem_Del(self->ob_item);
        self->ob_item = NULL;
        Py_SIZE(self) = 0;
        self->allocated = 0;
        return 0;
    }
    // ...
}

with this fix, the empty-case becomes slightly faster (about 10%) than the a=[0] case, as expected.

My claim, that pymalloc is faster for smaller sizes than the C-runtime memory manager, can be easily tested with bytes: if more than 512 bytes need to be allocated, pymalloc will fallback to simple malloc:

print(bytes(479).__sizeof__())   #  512
%timeit bytes(479)               # 189 ns ± 20.4 ns
print(bytes(480).__sizeof__())   #  513
%timeit bytes(480)               # 296 ns ± 24.8 ns

the actual difference is more than the shown 50% (this jump cannot be explained by change of the size by one byte alone), as at least some part of the time is used for initialization of byte object and so on.

Here is a more direct comparison with help of cython:

%%cython
from libc.stdlib cimport malloc, free
from cpython cimport PyMem_Malloc, PyMem_Del

def with_pymalloc(int size):
    cdef int i
    for i in range(1000):
        PyMem_Del(PyMem_Malloc(size))
        
def with_cmalloc(int size):
    cdef int i
    for i in range(1000):
        free(malloc(size))

and now

%timeit with_pymalloc(1)   #  15.8 µs ± 566 ns
%timeit with_cmalloc(1)    #  51.9 µs ± 2.17 µs

i.e. pymalloc is about 3 times faster (or about 35ns per allocation). Note: some compilers would optimize free(malloc(size)) out, but MSVC doesn't.

As another example: some time ago I have replaced the default allocator through pymalloc for a c++'s std::map which led to a speed up of factor 4.

For profiling the following script was used:

a=[0] # or a=[]
for _ in range(10000000):
    list(x for x in a)

together with VisualStudio's built-in performance profiler in Release-mode.

a=[0]-version needed 6.6 seconds (in profiler) while a=[] version needed 6.9 seconds (i.e. ca. 5% slower). After the "fix", a=[] needed only 5.8 seconds.

The share of time spent in list_resize and _PyObject_Realloc:

                     a=[0]          a=[]       a=[], fixed        
list_resize           3.5%          10.2%          3%
_PyObject_Realloc     3.2%           9.3%          1%

Obviously, there is variance from run to run, but the differences in running times are significant and can explain the lion's share of observed time difference.

Note: the difference of 0.3 second for 10^7 allocations is about 30ns per allocation - a number similar to the one we get for the difference between pymalloc's and c-runtime's allocations.

When verifying the above with debugger, one must be aware, that in the debug-mode Python uses a debug version of pymalloc, which appends additional data to the required memory, thus pymalloc will never be asked to allocate 0 bytes in debug-version, but 0 bytes + debug-overhead and there will be no fallback to malloc. Thus, one should either debug in release mode of switch to realease-pymalloc in debug-build (there is probably an option for it - I just don't know it, the relevant part in code is here and here).

@ead Well done, do you also know why `pymalloc` is faster? I'm reading the [source code](https://github.com/python/cpython/blob/d36cf5f1d20ce9f111a8fc997104785086e8eee6/Objects/obmalloc.c#L1612) but I'm not sure I understand it correctly. Is it that `pymalloc` just reuses the exact same memory for each of the fresh `lists`s that are created during each iteration of the performance test loop? So would this difference in performance vanish if only a single iteration was run (since then there's no memory to be reused)? — a_guest, Oct 23 '20 at 10:27
@a_guest, I don't know why pymalloc is faster (I really don't know how heap-memory works on Windows and not much about glibc's implementation). One thing is probably, that pymalloc doesn't work with multiple threads (thank you, GIL), so it doesn't have to lock anything. Yet this is only a speculation. But it doesn't surprise me: it would be strange to write pymalloc, if it would not outperform the underlying memory manager. — ead, Oct 23 '20 at 10:54
I think `Py_SIZE(self) = 0;` should be `Py_SET_SIZE(self, 0);`, as for example [here](https://github.com/python/cpython/blob/v3.9.0/Objects/listobject.c#L78-L79). — Kelly Bundy, Oct 27 '20 at 17:28
@HeapOverflow `Py_SET_SIZE` is quite new (3.9, https://docs.python.org/3/c-api/structures.html#c.Py_SET_SIZE) and is part of the following effort https://bugs.python.org/issue39573. So for Python>=3.9 `Py_SET_SIZE` should be preferred, for older Python versions `Py_SIZE(self)` is the way to go. — ead, Oct 28 '20 at 08:17

Why is list(x for x in a) faster for a=[0] than for a=[]?

1 Answers1