What you observe, is that pymalloc
(Python memory manager) is faster than the memory manager provided by your C-runtime.
It is easy to see in the profiler, that the main difference between both versions is that list_resize
and _PyObjectRealloc
need more time for the a=[]
-case. But why?
When a new list is created from an iterable, the list tries to get a hint how many elements are in the iterator:
n = PyObject_LengthHint(iterable, 8);
However, this doesn't work for generators and thus the hint is the default value 8
.
After the iterator is exhausted, the list tries to shrink, because there are only 0 or 1 element (and not the original capacity allocated due to a too large size-hint). For 1 element this would lead to (due to over-allocation) capacity of 4 elements. However, there is a special handling for the case of 0 elements: it will not be over-allocated:
// ...
if (newsize == 0)
new_allocated = 0;
num_allocated_bytes = new_allocated * sizeof(PyObject *);
items = (PyObject **)PyMem_Realloc(self->ob_item, num_allocated_bytes);
// ...
So in the "empty" case, PyMem_Realloc
will be asked for 0 bytes. This call will be passed via _PyObject_Malloc
down to pymalloc_alloc
, which in case of 0 bytes returns NULL
:
if (UNLIKELY(nbytes == 0)) {
return NULL;
}
However, _PyObject_Malloc
falls back to the "raw" malloc, if pymalloc
returns NULL
:
static void *
_PyObject_Malloc(void *ctx, size_t nbytes)
{
void* ptr = pymalloc_alloc(ctx, nbytes);
if (LIKELY(ptr != NULL)) {
return ptr;
}
ptr = PyMem_RawMalloc(nbytes);
if (ptr != NULL) {
raw_allocated_blocks++;
}
return ptr;
}
as can be easily seen in the definition of _PyMem_RawMalloc
:
static void *
_PyMem_RawMalloc(void *ctx, size_t size)
{
/* PyMem_RawMalloc(0) means malloc(1). Some systems would return NULL
for malloc(0), which would be treated as an error. Some platforms would
return a pointer with no memory behind it, which would break pymalloc.
To solve these problems, allocate an extra byte. */
if (size == 0)
size = 1;
return malloc(size);
}
Thus, the case a=[0]
will use pymalloc
, while a=[]
will use the memory manager of the underlying c-runtime, which explains the observed difference.
Now, this all can be seen as missed optimization, because for newsize=0
, we could just set the ob_item
to NULL
, adjust other members and return.
Let's try it out:
static int
list_resize(PyListObject *self, Py_ssize_t newsize)
{
// ...
if (newsize == 0) {
PyMem_Del(self->ob_item);
self->ob_item = NULL;
Py_SIZE(self) = 0;
self->allocated = 0;
return 0;
}
// ...
}
with this fix, the empty-case becomes slightly faster (about 10%) than the a=[0]
case, as expected.
My claim, that pymalloc
is faster for smaller sizes than the C-runtime memory manager, can be easily tested with bytes
: if more than 512 bytes need to be allocated, pymalloc
will fallback to simple malloc
:
print(bytes(479).__sizeof__()) # 512
%timeit bytes(479) # 189 ns ± 20.4 ns
print(bytes(480).__sizeof__()) # 513
%timeit bytes(480) # 296 ns ± 24.8 ns
the actual difference is more than the shown 50% (this jump cannot be explained by change of the size by one byte alone), as at least some part of the time is used for initialization of byte object and so on.
Here is a more direct comparison with help of cython:
%%cython
from libc.stdlib cimport malloc, free
from cpython cimport PyMem_Malloc, PyMem_Del
def with_pymalloc(int size):
cdef int i
for i in range(1000):
PyMem_Del(PyMem_Malloc(size))
def with_cmalloc(int size):
cdef int i
for i in range(1000):
free(malloc(size))
and now
%timeit with_pymalloc(1) # 15.8 µs ± 566 ns
%timeit with_cmalloc(1) # 51.9 µs ± 2.17 µs
i.e. pymalloc
is about 3 times faster (or about 35ns per allocation). Note: some compilers would optimize free(malloc(size))
out, but MSVC doesn't.
As another example: some time ago I have replaced the default allocator through pymalloc for a c++'s std::map
which led to a speed up of factor 4.
For profiling the following script was used:
a=[0] # or a=[]
for _ in range(10000000):
list(x for x in a)
together with VisualStudio's built-in performance profiler in Release-mode.
a=[0]
-version needed 6.6 seconds (in profiler) while a=[]
version needed 6.9 seconds (i.e. ca. 5% slower). After the "fix", a=[]
needed only 5.8 seconds.
The share of time spent in list_resize
and _PyObject_Realloc
:
a=[0] a=[] a=[], fixed
list_resize 3.5% 10.2% 3%
_PyObject_Realloc 3.2% 9.3% 1%
Obviously, there is variance from run to run, but the differences in running times are significant and can explain the lion's share of observed time difference.
Note: the difference of 0.3
second for 10^7
allocations is about 30ns per allocation - a number similar to the one we get for the difference between pymalloc's and c-runtime's allocations.
When verifying the above with debugger, one must be aware, that in the debug-mode Python uses a debug version of pymalloc, which appends additional data to the required memory, thus pymalloc will never be asked to allocate 0 bytes in debug-version, but 0 bytes + debug-overhead
and there will be no fallback to malloc
. Thus, one should either debug in release mode of switch to realease-pymalloc in debug-build (there is probably an option for it - I just don't know it, the relevant part in code is here and here).