The amount of memory a Python set spends increases in steps

Question

I do an experiment about how much memory each of Python array types spends, which is list, tuple, set, dict, np.array. Then i got the following result.

(x axis is the length of array, y axis is the memory size.)

I found that the amount of memory a Python set spends increases in steps(also dict), while those of others increase linearly as i expected. I wonder what makes them different.

I used following get_size() function. (reference)

def get_size(obj, seen = None):
    size = sys.getsizeof(obj)
    if seen is None:
        seen = set()
    obj_id = id(obj)
    if obj_id in seen:
        return 0
    seen.add(obj_id)
    if isinstance(obj, dict):
        size += sum([get_size(v, seen) for v in obj.values()])
        size += sum([get_size(k, seen) for k in obj.keys()])
    elif hasattr(obj, '__dict__'):
        size += get_size(obj.__dict__, seen)
    elif hasattr(obj, '__iter__') and not isinstance(obj, (str, bytes, bytearray)):
        size += sum([get_size(i, seen) for i in obj])
    return size

And i measured the memory from length 0 to 10,000 in 100 intervals.

my code : https://repl.it/repls/WanEsteemedLines

For future reference, you may want to show the benchmark *code* that you used. — Bram Vanroy, Oct 16 '18 at 08:44
Yes, show the benchmarking code please, this will allow others to try to reproduce what you've shown and moreover ensure that this behaviour isn't an artefact of your code setup — Chris_Rands, Oct 16 '18 at 08:51
set and dict are both hash-based data structure. Presumably for load-balancing they have to allocate extra "buckets" (which then requires extra memory) to ensure they remain efficient. I think that is what causes the distinctive "steps" compared to the other data structures being compared. — lightalchemist, Oct 16 '18 at 08:53
that's also the case for `list` (we should see some small steps at least) — Jean-François Fabre, Oct 16 '18 at 08:54
Thanks for comment! I added a function that i used to get size of memory, and i will upload whole code soon. — Ellisein, Oct 16 '18 at 08:55
Sorry i'm late. i uploaded my code to https://repl.it/repls/WanEsteemedLines. Thanks :) — Ellisein, Oct 16 '18 at 09:10
@Jean-FrançoisFabre Incrementally appending to a list uses a different amount of memory from building a list straight from a range like what the op is currently doing. If you plot the memory usage from incrementallly appending to the list you will get the small steps. Just tried it. — lightalchemist, Oct 16 '18 at 09:29
@lightalchemist I did the same job with not fixed size but incrementally appending as you said, then i got a following result > https://i.stack.imgur.com/8vSuJ.png Thanks :) — Ellisein, Oct 17 '18 at 00:21

score 4 · Accepted Answer · answered Nov 05 '18 at 20:57

CPython sets and dicts always use a power-of-two-sized internal hash table. list, tuple, and numpy.ndarray all have more flexibility regarding the size of their underlying memory buffer, but set and dict are hardcoded to use power-of-two table sizes. The implementation cannot function with a non-power-of-two table size. See Objects/dictobject.c and Objects/setobject.c.

The jumps in your graph are when the table size jumps to a new power of two.

Incidentally, your get_size doesn't work very well. For example, it has two bugs affecting the numpy.ndarray case that almost cancel out (but don't quite). It tries to add the sizes of the elements of a NumPy array to the size of the whole array, but for a NumPy array, the sizes of the elements are already accounted for by getsizeof. Also, it's determining object identity with id, but the objects produced by iterating over a NumPy array are created on the fly and die immediately, so their id values aren't unique. In practice, this probably overcounts the size by one or two times the size of the objects representing array elements.

The amount of memory a Python set spends increases in steps

1 Answers1