6

I was reading the book 'Fluent Python' when I encountered a sentence where the author states that

If you need to store 10 million floating-point values an array is much more efficient, because an array does not actually hold full fledged objects, but only the packed bytes representing their machine values - just like array in C language.

I am not able to understand what the author is trying to convey. What is he saying about 'packed bytes' ? what does 'packed bytes storing' mean ? . How is python lists storing it ? why is it not storing it that way if that is what is making it efficient ?

Harish Kayarohanam
  • 3,886
  • 4
  • 31
  • 55
  • 1
    Possible duplicate of [why are numpy arrays so fast](http://stackoverflow.com/questions/8385602/why-are-numpy-arrays-so-fast) – Lucas Dec 23 '16 at 20:33
  • An array is (usually) a contiguous part of the the memory that contains homogeneous types. This allows for multiple optimizations vs linked lists (python's `list`). – Reut Sharabani Dec 23 '16 at 20:35
  • Python lists are not guaranteed to be contiguous. C arrays, from which Array.array is constructed, are contiguous in memory. They are stored as scalars (typically on the stack) whereas objects use memory allocated on the heap. – Charles D Pantoga Dec 23 '16 at 20:37
  • @CharlesAddis. I am not sure about array memory being stored on the stack. – Mad Physicist Dec 23 '16 at 20:38
  • 2
    @ReutSharabani. Python's `list` is not a linked list at all. It is a contiguous piece of memory that holds the pointers to the objects themselves. That extra level of indirection is what makes it inifficient. – Mad Physicist Dec 23 '16 at 20:39
  • @MadPhysicist thanks, I assumed they were for some reason. – Reut Sharabani Dec 23 '16 at 20:42
  • @ReutSharabani. I don't think there's anything preventing that in the spec. I just happen to have seen the source code of the reference implementation. – Mad Physicist Dec 23 '16 at 20:51
  • @MadPhysicist everything in C is stored on the stack unless specifically stored on the heap using the `alloc` family of functions. – Charles D Pantoga Dec 25 '16 at 21:38
  • @ CharlesAddis agreed. I just meant that I'm pretty sure the data buffer for Python array objects is generally malloced. I'll look it up just to make sure, but I can't imagine it working otherwise since a stack buffer would only live in the scope of the constructor or initializer. – Mad Physicist Dec 27 '16 at 03:12

2 Answers2

9

Let's say you're dealing with 8-byte floating-point numbers. "Packed bytes" in this context means that there's a dedicated chunk of allocated memory in which the first 8 bytes represent the first float, and then immediately the next 8 bytes represent the next float, and so on with no wastage. It's the most space-efficient way there is of storing the data (at least, without compression). It may also be the most time-efficient for certain operations (for example, arraywise arithmetic operations).

A Python list doesn't store things that way. For one thing, one list element could be a float but the next one might be some other type of object. For another thing, you can remove, insert or replace items in a list. Some of these operations involve lengthening or shortening the list dynamically. All are very time- and memory- inefficient if items are stored as packed bytes. The Python list class is designed to be as general-purpose as possible, making compromises between the efficiency of various types of operations.

Probably the most important difference is that a Python list, in its underlying implementation, is a container full of pointers to objects, rather than a container full of raw object content. One implication of this is that multiple references to the same Python object can appear in a list. Another is that changing a particular item can be done very efficiently. For example, let's say the first item in your list, a[0], is an integer, but you want to replace it with a string that takes up more memory, e.g. a[0] = "There's a horse in aisle five." A packed array would have to (a) make extra room, shifting all of the rest of the array content in memory and (b) separately update some sort of index of item sizes and types. Like most languages, Python's packed-array implementation (array.array) would not even allow this: instead, it makes more sense for arrays to guarantee and enforce uniform element size and type. By contrast, a Python list would only have to overwrite one pointer value with another in this situation, and has no such restrictions.

In fact, it should hopefully be clear by now that these pointers themselves do not even point directly to object content. For example, they would not point directly to the 8 bytes that contain a floating-point value, but rather to PyObj structures that carry all the necessary meta-information (such as the declaration "my content must be interpreted as a floating-point number") as well as the content itself.

In the CPython implementation, the pointers themselves may still be (more or less) packed in memory. This means that inserting a new item into a list will usually still be inefficient (relative to the way it would be if the Python list implementation used, say, a link-list structure under the hood).

In general, there's no absolute "efficient" or "inefficient"—it's all a question of which resource you're being efficient with, what content types (and restrictions on content type) there are in the container, and how you are intending to transform the container or its contents.

jez
  • 14,867
  • 5
  • 37
  • 64
  • 3
    Lists are stored the same way as arrays. The main difference is that they hold pointers. Resizing a list is just about as inefficient as an array, and very often involves reallocating the whole thing (like with `append`). – Mad Physicist Dec 23 '16 at 20:40
  • @MadPhysicist That's the basic principle, though I believe lists are not always guaranteed to be contiguous in memory. But I see your point, perhaps I shouldn't be giving the impression that a `list` is designed to be optimal for any operation in particular – jez Dec 23 '16 at 20:46
  • @jez I'd emphasize that arrays hold two guarantees (usually) - type and size. These are the assumptions that allow them to be more optimal to my understanding. – Reut Sharabani Dec 23 '16 at 20:48
4

Under the hood, Python mostly works with pointers. A Python list is actually an array of pointers (at least in the reference implementation). That is part of where the inefficiency comes from. The other part is that Python normally uses duck-typing, so you can not tell if an operation is going to work on a list element until you try it.

When you access a list element to do something with it, e.g. call the __add__ method, you have to A) dereference the pointer to the actual object, B) find out if it has an __add__ attribute, check if it is callable, then actually make the call.

In an array, you know the type of data being stored up-front, so you know all the attributes it has and all the operations. On top of that, all actual the numbers are packed together into a single piece of memory. This is not the case with lists. Since numbers are just a type of object in Python, lists of numbers are really lists of generic object pointers.

To sum up, arrays store raw numbers instead of pointers to large objects wrapping the numbers, making them smaller. Arrays know the type of their contents up-front, so applying an operation to the entire array only requires one check, instead of a check on every element as with lists.

Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
  • I'd add that you know the size as well (And it doesn't change without re-allocation), this is what allows storing everything continuous in memory. – Reut Sharabani Dec 23 '16 at 20:54
  • @ReutSharabani. Knowing the size is irrelevant. You know the size of a list too. – Mad Physicist Dec 23 '16 at 20:55
  • Not true. Lists can change size without re-allocation, arrays usually don't have any size-modifying operation. Their length is usually a property not a function. Generally. – Reut Sharabani Dec 23 '16 at 20:57
  • Nitpick here, but it's an attribute. A property is even worse than a function. It's an entire class. – Mad Physicist Dec 23 '16 at 20:58
  • I'm not talking about python specifically, just concepts :) – Reut Sharabani Dec 23 '16 at 20:59
  • @ReutSharabani This question is specifically about arrays as represented by the built-in module `array` – pvg Dec 23 '16 at 20:59
  • That aside though, you don't apply an operation to a list when it is being reallocated, so for the purposes of actually using a list, it's size is fixed. Also, the list size is, under the hood, just an attribute. – Mad Physicist Dec 23 '16 at 20:59
  • @MadPhysicist `append` resizes a list (and re-allocates memory sometimes), not sure what you mean. Arrays do not normally have this option. If the quote said "if you want 10-20 floating points" choosing arrays would be less justified. Same is true if it read "if you want floating points and strings". – Reut Sharabani Dec 23 '16 at 21:01
  • When you say "add 4 to each element of this thing", the size of "thing" is going to be fixed for the entirety of the "adding 4" operation, be it an array or a list. Whether you have a fancy `append` method to do the reallocation under the hood, as with a list, or you have to make a new larger array and copy in the old one to reallocate is completely irrelevant to the efficiency of accessing the data. – Mad Physicist Dec 23 '16 at 21:04