Edit: the answer from Memory usage of a list of millions of strings in Python can be adapted to sets too.
By analyzing the RAM usage on my machine (with the process manager), I noticed that a set of millions of strings like 'abcd'
takes much less memory than a set of millions of bytes b'abcd'
(Edit: I was wrong, it was due to an error elsewhere). I would like to test this:
import random, string, sys
randomstring = lambda length: ''.join(random.choice(string.ascii_lowercase) for _ in range(length))
s1 = {randomstring(10) for i in range(100_000)}
s2 = {randomstring(50) for i in range(100_000)}
s3 = {randomstring(10).encode() for i in range(100_000)}
s4 = {randomstring(50).encode() for i in range(100_000)}
print(sys.getsizeof(s1), sys.getsizeof(s2), sys.getsizeof(s3), sys.getsizeof(s4))
but here it always gives the same size: 4194528
whereas the size should vary with a factor x5, and probably be different for the string vs bytes case.
How to measure the memory size taken by these sets and all its elements?
Note: I know that finding the whole memory taken by a structure is not easy in Python (see also In-memory size of a Python structure), because we need to take in account all the linked elements.
TL;DR: Is there a tool in Python to automatically measure the memory size of a set + the memory taken by the internal references (pointers?), the hashtable buckets, the elements (strings here) that are hosted in the set...? In short: every byte that is necessary for this set of strings. Does such a memory measurement tool exist?