Memory usage of a list of millions of strings in Python

Question

As seen in Find the memory size of a set of strings vs. set of bytestrings, it's difficult to precisely measure the memory used by a set or list containing strings. But here is a good estimation/upper bound:

import os, psutil
process = psutil.Process(os.getpid())
a = process.memory_info().rss
L = [b"a%09i" % i for i in range(10_000_000)]
b = process.memory_info().rss
print(L[:10])  # [b'a000000000', b'a000000001', b'a000000002', b'a000000003', b'a000000004', b'a000000005', b'a000000006', b'a000000007', b'a000000008', b'a000000009']
print(b-a)
# 568762368 bytes

i.e. 569 MB for 100 MB of actual data.

Solutions to improve this (for example with other data structures) have been found in Memory-efficient data structure for a set of short bytes-strings and Set of 10-char strings in Python is 10 times bigger in RAM as expected, so here my question is not "how to improve", but:

How can we precisely explain this size in the case of a standard list of byte-string?

How many bytes for each byte-string, for each (linked?) list item to finally obtain 569 MB?

This will help to understand the internals of lists and bytes-strings in CPython (platform: Windows 64 bit).

Pychopath · Accepted Answer · 2022-02-23T10:08:31.723

3

Summary:

89 MB for the list object
480 MB for the string objects
=> total 569 MB

sys.getsizeof(L) will tell you the list object itself is about 89 MB. That's a few dozen organizational bytes, 8 bytes per bytestring reference, and up to 12.5% overallocation to allow efficient insertions.

sys.getsizeof(one_of_your_bytestrings) will tell you they're 43 bytes each. That's:

8 bytes for the reference counter
8 bytes for the pointer to the type
8 bytes for the length (since bytestrings aren't fixed size)
8 bytes hash
10 bytes for your actual bytestring content
1 byte for a terminating 0-byte.

Storing the objects every 43 bytes in memory would cross memory word boundaries, which is slower. So they're actually stored usually every 48 bytes. You can use id(one_of_your_bytestrings) to get the addresses to check.

(There's some variance here and there, partly due to the exact memory allocations that happen, but 569 MB is about what's expected knowing the above reasons, and it matches what you measured.)

edited Feb 23 '22 at 10:08

answered Feb 23 '22 at 10:02

Pychopath

1,560
1
8

Good idea to use `id()`. I confirm the same here: `print(id(L[1000]), id(L[1001]), id(L[1001]) - id(L[1000])) # it takes 48 bytes (alignement)` – Basj Feb 23 '22 at 10:14
2

@Basj You can also do something like `collections.Counter(b - a for a, b in itertools.pairwise(sorted(map(id, L)))).most_common()`, that'll likely show you that 99.7% of the differences are 48 bytes, 0.3% are 112 bytes, and very few other differences. If you look closer, you'll likely find that every 340th difference is 112 bytes. CPython seems to work with 16 kB blocks, storing 340 bytestring objects in each. I considered mentioning this, but I don't know more about it and I think it's a not that meaningful implementation detail. – Pychopath Feb 23 '22 at 10:24
I just learned `collections.Counter(...).most_common()` which I was often doing manually - this will save me time in future benchmarks. Awesome! – Basj Feb 23 '22 at 10:26

Memory usage of a list of millions of strings in Python

1 Answers1

Linked

Related