Find the memory size of a set of strings vs. set of bytestrings

Question

Edit: the answer from Memory usage of a list of millions of strings in Python can be adapted to sets too.

By analyzing the RAM usage on my machine (with the process manager), I noticed that a set of millions of strings like 'abcd' takes much less memory than a set of millions of bytes b'abcd' (Edit: I was wrong, it was due to an error elsewhere). I would like to test this:

import random, string, sys
randomstring = lambda length: ''.join(random.choice(string.ascii_lowercase) for _ in range(length))
s1 = {randomstring(10) for i in range(100_000)}
s2 = {randomstring(50) for i in range(100_000)}
s3 = {randomstring(10).encode() for i in range(100_000)} 
s4 = {randomstring(50).encode() for i in range(100_000)} 
print(sys.getsizeof(s1), sys.getsizeof(s2), sys.getsizeof(s3), sys.getsizeof(s4))

but here it always gives the same size: 4194528 whereas the size should vary with a factor x5, and probably be different for the string vs bytes case.

How to measure the memory size taken by these sets and all its elements?

Note: I know that finding the whole memory taken by a structure is not easy in Python (see also In-memory size of a Python structure), because we need to take in account all the linked elements.

TL;DR: Is there a tool in Python to automatically measure the memory size of a set + the memory taken by the internal references (pointers?), the hashtable buckets, the elements (strings here) that are hosted in the set...? In short: every byte that is necessary for this set of strings. Does such a memory measurement tool exist?

Strings can contain multibyte characters. But if all the characters are ASCII, they just take 1 byte, just like the elements of byte strings.. I can't think of any reason why the memory use should be different in this case. — Barmar, Feb 21 '22 at 20:37
There's probably very little difference between a character string and byte string. They both contain a reference to the class object (just like most Python objects), the length, and the raw data. The Python interpreter has built-in code for processing strings, but that doesn't affect the data representation. — Barmar, Feb 23 '22 at 14:31
@Barmar I now understand your first comment was about strings vs. byte-strings. (I thought it was about size of byte-string vs. number of char, but I misread). Yeah you're right, there is no much diff betwen strings vs. bytes, finally (after new tests) just a little bit less for bytes. — Basj, Feb 23 '22 at 14:41

score 1 · Accepted Answer · answered Feb 21 '22 at 21:14

1

sys.getsizeof does not measure the size of the full target data structure. It only measure the memory taken by the set object which contains references to strings/bytes objects. The references are not included in the returned memory consumption (ie. it does not walk recursively in each object of the target data structure). A reference takes typically 8 bytes on a 64-bit platform and a CPython set is not as compact as a list: it is implemented like a hash-table with many buckets and some buckets are unused. In fact, this is mandatory for this data structure to be fast (in general, the occupancy should be 50%-90%). Moreover, each bucket contains a hash which usually takes 8 bytes.

The string themselves take much more space than a bucket (at least on my machine):

sys.getsizeof(randomstring(50))           # 99
sys.getsizeof(randomstring(50).encode())  # 83

On my machine, it turns out that CPython strings are 16 bytes bigger than bytes.

answered Feb 21 '22 at 21:14

Jérôme Richard

41,678
6
29
59

Thanks. Is there a tool in Python to automatically measure the size of a set + the references + the hashtable buckets + the elements (strings here) that are hosted in the set? – Basj Feb 22 '22 at 08:09
1

You can check the memory taken by the Python process (see [this](https://stackoverflow.com/questions/9850995/tracking-maximum-memory-usage-by-a-python-function) post for example). This is probably the most accurate solution to find an upper bound of the memory consumption. There are parts of the allocated memory that CPython cannot track. For example, Numpy arrays are allocated using the C allocator so CPython is not aware of that. This applies to many C-based libraries. Note that this is an upper bound since some allocators are conservative: they do not release memory to the OS directly. – Jérôme Richard Feb 23 '22 at 00:23
1

Note also that reference can form cycles so this is also why tracking the total amount of memory is a bit tricky in practice. Not to mention some references can appear several time in the same list. I found [this](https://stackoverflow.com/questions/33978/find-out-how-much-memory-is-being-used-by-an-object-in-python) related post that provides more information. – Jérôme Richard Feb 23 '22 at 00:29
You're right @JérômeRichard, so I'll probably continue using [this simple upper bound](https://stackoverflow.com/a/21632554/1422096) as an indicator of the size of the multiple-gigabyte set. During my tests, I only have one big set and nothing else in my code, this is a good estimation I think. – Basj Feb 23 '22 at 06:27

Find the memory size of a set of strings vs. set of bytestrings

1 Answers1

Linked