1

I am writing a program where I need to know the efficiency (memory wise) of different data containers in Python / Cython. One of said containers is the standard Python list.

The Python list is tripping me up because I do not know how it works on the binary level. Unlike Python, C's arrays are easy to understand, because all of the elements are the same type, and the space is declared ahead of time. This means when the programmer wants to go in and index the array, the program knows mathematically what memory address to go to. But the problem is, a Python list can store many different data types, and even nested lists inside of a list. The size of these data structures changes all the time, and the list still holds them, accounting for the changes. Does extra separator memory exist to make the list as dynamic as it is?

If you could, I would appreciate an actual binary layout of an example list in RAM, annotated with what each byte represents. This will help me to fully understand the inner workings of the list, as I am working on the binary level.

Nick Pandolfi
  • 993
  • 1
  • 8
  • 22
  • 1
    Here is a nice talk by Brandon Rhodes (PyCon 2014) about how the builtin data structures work internally: http://pyvideo.org/video/2571/all-your-ducks-in-a-row-data-structures-in-the-s – Ismail Badawi Jun 11 '14 at 20:06
  • My question was 'on the binary level', not a duplicate... – Nick Pandolfi Jun 11 '14 at 20:10

1 Answers1

4

The list object is defined in Include/listobject.h. The structure is really simple:

typedef struct {
    PyObject_VAR_HEAD
    /* Vector of pointers to list elements.  list[0] is ob_item[0], etc. */
    PyObject **ob_item;

    /* ob_item contains space for 'allocated' elements.  The number
     * currently in use is ob_size.
     * Invariants:
     *     0 <= ob_size <= allocated
     *     len(list) == ob_size
     *     ob_item == NULL implies ob_size == allocated == 0
     * list.sort() temporarily sets allocated to -1 to detect mutations.
     *
     * Items must normally not be NULL, except during construction when
     * the list is not yet visible outside the function that builds it.
     */
    Py_ssize_t allocated;
} PyListObject;

and PyObject_VAR_HEAD is defined as

typedef struct _object {
    _PyObject_HEAD_EXTRA
    Py_ssize_t ob_refcnt;
    struct _typeobject *ob_type;
} PyObject;

typedef struct {
    PyObject ob_base;
    Py_ssize_t ob_size; /* Number of items in variable part */
} PyVarObject;

Basically, then, a list object looks like this:

[ssize_t ob_refcnt]
[type *ob_type]
[ssize_t ob_size]
[object **ob_item] -> [object *][object *][object *]...
[ssize_t allocated]

Note that len retrieves the value of ob_size.

ob_item points to an array of PyObject * pointers. Each element in a list is a Python object, and Python objects are always passed by reference (at the C-API level, as pointers to the actual PyObjects). Therefore, lists only store pointers to objects, and not the objects themselves.

When a list fills up, it will be reallocated. allocated tracks how many elements the list can hold at maximum (before reallocation). The reallocation algorithm is in Objects/listobject.c:

/* Ensure ob_item has room for at least newsize elements, and set
 * ob_size to newsize.  If newsize > ob_size on entry, the content
 * of the new slots at exit is undefined heap trash; it's the caller's
 * responsibility to overwrite them with sane values.
 * The number of allocated elements may grow, shrink, or stay the same.
 * Failure is impossible if newsize <= self.allocated on entry, although
 * that partly relies on an assumption that the system realloc() never
 * fails when passed a number of bytes <= the number of bytes last
 * allocated (the C standard doesn't guarantee this, but it's hard to
 * imagine a realloc implementation where it wouldn't be true).
 * Note that self->ob_item may change, and even if newsize is less
 * than ob_size on entry.
 */
static int
list_resize(PyListObject *self, Py_ssize_t newsize)
{
    PyObject **items;
    size_t new_allocated;
    Py_ssize_t allocated = self->allocated;

    /* Bypass realloc() when a previous overallocation is large enough
       to accommodate the newsize.  If the newsize falls lower than half
       the allocated size, then proceed with the realloc() to shrink the list.
    */
    if (allocated >= newsize && newsize >= (allocated >> 1)) {
        assert(self->ob_item != NULL || newsize == 0);
        Py_SIZE(self) = newsize;
        return 0;
    }

    /* This over-allocates proportional to the list size, making room
     * for additional growth.  The over-allocation is mild, but is
     * enough to give linear-time amortized behavior over a long
     * sequence of appends() in the presence of a poorly-performing
     * system realloc().
     * The growth pattern is:  0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ...
     */
    new_allocated = (newsize >> 3) + (newsize < 9 ? 3 : 6);

    /* check for integer overflow */
    if (new_allocated > PY_SIZE_MAX - newsize) {
        PyErr_NoMemory();
        return -1;
    } else {
        new_allocated += newsize;
    }

    if (newsize == 0)
        new_allocated = 0;
    items = self->ob_item;
    if (new_allocated <= (PY_SIZE_MAX / sizeof(PyObject *)))
        PyMem_RESIZE(items, PyObject *, new_allocated);
    else
        items = NULL;
    if (items == NULL) {
        PyErr_NoMemory();
        return -1;
    }
    self->ob_item = items;
    Py_SIZE(self) = newsize;
    self->allocated = new_allocated;
    return 0;
}

As you can see from the comments, lists grow rather slowly, in the following sequence:

0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ...
nneonneo
  • 171,345
  • 36
  • 312
  • 383