0

I've read that python lists are supposed to take 64 bytes, and that every entry in a list is supposed to 'cost' 8 additional bytes. I decided to test this out. However, while testing this, I discovered that this depends on how you add items to your list. Why are sys.getsizeof for ob1 and obj2not consistent in the code below?

import sys

test1 = 'This is a string.'

obj1 = []

obj2 = ['T', 'h', 'i', 's', ' ', 'i', 's', ' ', 'a', ' ', 's', 't', 'r', 'i', 'n', 'g', '.']

obj3 = list(test1)

for i in range(len(test1)):
    obj1.append(test1[i])

print(sys.getsizeof(obj1))
print(obj1)
print(sys.getsizeof(obj2))
print(obj2)
print(sys.getsizeof(obj3))
print(obj3)

>>>264
>>>['T', 'h', 'i', 's', ' ', 'i', 's', ' ', 'a', ' ', 's', 't', 'r', 'i', 'n', 'g', '.']
>>>200
>>>['T', 'h', 'i', 's', ' ', 'i', 's', ' ', 'a', ' ', 's', 't', 'r', 'i', 'n', 'g', '.']
>>>264
>>>['T', 'h', 'i', 's', ' ', 'i', 's', ' ', 'a', ' ', 's', 't', 'r', 'i', 'n', 'g', '.']

obj2 reports the size that I expected (64+8*17=200). What is the overhead after using the append-function, and can that overhead somehow be removed after constructing the list?

I've read through this related topic here, but I don't think they have the same answer as the other one seemed to be related to Pandas.

martineau
  • 119,623
  • 25
  • 170
  • 301
RoyM
  • 735
  • 3
  • 14
  • 1
    because when you use a list literal, Python knows the *exact* size. If you use `.append`, it will oversize when it re-sizes. More importantly, **why do you care**? This overhead is a *infinitesimal fraction*, and it's what gets you amortized constant-time appends, which is marvelous. – juanpa.arrivillaga Apr 22 '19 at 22:31
  • Almost certainly, a duplicate of this: https://stackoverflow.com/questions/51526242/why-do-two-identical-lists-have-a-different-memory-footprint/51526419 although it isn't exactly the same. In that one, you're seeing this behavior when you use `some_list * some_int`, another example of when the interpreter can know *exactly* the size of the resulting list at runtime. Same situation with your list literal, although, digging up that source code will be annoying. But if you `dis.dis` a list-literal expression, you'll see the `BUILD_LIST` opcode – juanpa.arrivillaga Apr 22 '19 at 22:37
  • I **care** because I intended to use the value to calculate compression-rates. I needed to know where the difference came from, so that I could correctly account for it. And I don't think that's the more important thing, why I care I mean. It's an interesting question in and of itself. But thank you for the link, strange it didn't pop up for any of my searches. – RoyM Apr 22 '19 at 22:53
  • 2
    You should be **very wary** of using `sys.getsizeof` for that sort of thing. Note, it *only* gives you the size of the buffer + python object overhead, the buffer will be the size of that PyObject* array times whatever a PyObject pointer is on your system (generally a machine word). Furthermore, reading through both of those questions, you will see that the way the underlying buffer grows can be subtly different based on various things. You should consider changing your approach. What are you compressing? – juanpa.arrivillaga Apr 22 '19 at 22:55
  • I see. And thank you for asking. I'm preforming an LZW compression of a string. – RoyM Apr 22 '19 at 22:58
  • 1
    Then you should use a `str`, or better yet, a `bytes` object. `bytes` especially are made for this sort of thing, and will scale much more predictably. I'm actually not sure about a `bytearray`, which may really be the best. `bytearray` may overallocate, but maybe not? In either case, if you create a `bytes` from a `bytearray` it should give you the exact size + some constant overhead, much more reliable than a `list` object. Sounds like fun :) – juanpa.arrivillaga Apr 22 '19 at 23:00
  • Alright, I think I got it. Thanks for the clarification! – RoyM Apr 22 '19 at 23:03

1 Answers1

0

The overhead is due to Python allocating a few extra "slots" in the list in order for you to be able to e.g. append to the list, without needing to allocate new memory and copy the old list there.

You can see this by appending a few values to your lists:

>>> obj1.append('!')
>>> print(sys.getsizeof(obj1))
264

>>> obj2.append('!')
>>> print(sys.getsizeof(obj2))
272

As you can see, it is possible to add to obj1 without it growing, whereas the obj2 needs to allocate a new "chunk".

This is an implemenation detail but the behavior is typically good when you dynamically increases the sizes of your lists, as it will reduce the memory fragmentation (and increase the speed) of your program.

JohanL
  • 6,671
  • 1
  • 12
  • 26
  • 1
    You seem to be thinking of Python lists as unrolled linked lists or something. They're not; they're [dynamic arrays](https://en.wikipedia.org/wiki/Dynamic_array). They don't allocate chunks, and the extra room isn't there to avoid memory fragmentation. Python lists use a [single contiguous buffer](https://docs.python.org/3/faq/design.html#how-are-lists-implemented-in-cpython) of element references, and the extra space is there to avoid having to reallocate the buffer on every append. This allows amortized constant time appends instead of linear time appends. – user2357112 Apr 22 '19 at 22:37
  • @user2357112 Whether you allocate a new list and copy the content or actually use a linked list, the behavior is the same in terms of memory fragmentation and the ability to append to the list. But, ok, I could be a bit clearer in how to explain it. – JohanL Apr 22 '19 at 22:40