1

Tying to reduce memory space taken by a large dictionary I changed dictionary's structure from initially {string: boolean} to {int: boolean}. To my surprise, the memory used by the dictionary remained the same:

print(sys.getsizeof(myDictionary))
>140584

Could you explain to my why using int (of size 24 bytes) instead of strings (of at least 60 bytes, probably more due to my data type) doesn't help reducing the full dictionary size? Is it because both already link to an object?

Here is how the dictionary is computed:

  • for {string: boolean} dictionary

myDictionary ={feat:(feat in item_feature_list) for feat in model_features_list}

  • for {int: boolean} dictionary

myDictionary = {int(i):(feat in item_feature_list) for feat, i in enumerate (model_features_list)}

thanks.

ylnor
  • 4,531
  • 2
  • 22
  • 39
  • 2
    `sys.getsizeof` doesn't tell you anything about the memory footprint of the keys and values, only of the dictionary itself. It varies with the **number** of entries, not the nature of the entries. – Martijn Pieters Mar 01 '17 at 14:04
  • Thx, How can I monitor the memory used of the full dictionary then? – ylnor Mar 01 '17 at 14:06
  • @MartijnPieters the dictionary keeps the hash but has to keep the keys themselves to compare keys when hash matches (collisions) right? in that case, an integer may be smaller than a string. – Jean-François Fabre Mar 01 '17 at 14:07
  • @Jean-FrançoisFabre: hashes have a fixed size, the hash for an `int` is not smaller than the hash for a `str` object; they are both just C integers in the data structure. – Martijn Pieters Mar 01 '17 at 14:07
  • @MartijnPieters yes, but 2 keys can have the same hash so the dictionary _has_ to store the list of keys, so if the hash matches, it checks if passed key matches the stored key (else it wouldn't handle collisions) – Jean-François Fabre Mar 01 '17 at 14:09
  • `enumerate()` produces the index *first*, so I think your 2nd comprehension should be ...`i, feat in enumerate`... – Chris_Rands Mar 01 '17 at 14:09
  • @Jean-FrançoisFabre: yes, keys are stored as *C pointers*. Pointers to Python objects are all the same (platform dependent) size, regardless of the Python object type. – Martijn Pieters Mar 01 '17 at 14:12
  • 1
    @Jean-FrançoisFabre This is not how is works. Python uses a "back up" algorithm to compute the alternative slot. So if a collision occurs while inserting, it uses the back-up algorithm until empty slot is found. The same procedure is used when retrieving the key. – hashcode55 Mar 01 '17 at 14:12
  • @hashcode55 I guess that's why you chose your nickname like that. So the chances of collision with _both_ algorithms are considered 0 or is this proven? – Jean-François Fabre Mar 01 '17 at 14:13
  • @hashcode55: Jean-François is not saying anything about how collisions are handled, only that the keys are stored to detect collisions. Of course the dict has to keep track of the keys (how else could it enumerate them later on) but that doesn't mean that the type of the key matters to the dict memory footprint. – Martijn Pieters Mar 01 '17 at 14:13
  • This discussion is going way, *way* off topic. Perhaps you can take it to a chatroom? And hash collisions still happen, the chance is never 0, that's not what distinguishes different hash collision handling strategies. – Martijn Pieters Mar 01 '17 at 14:15
  • @ylnor: see the duplicate, you need to recursively get the size for all objects involved. – Martijn Pieters Mar 01 '17 at 14:15
  • @Chris_Rands: the variable names are merely misnamed, probably *both* as the targets and in the key and value expressions. – Martijn Pieters Mar 01 '17 at 14:16

1 Answers1

2

The size of the dictionary is independent of the type of key used. Whether you use string or int, python will always allocate the same space for the key. It'll apply the hash function to the key and pick up the last 3 bits of computed hash value as key (this value grows as the size of dictionary grows to avoid collisions).

Python uses the hash function to compute the key's hash value, which will always be an integer indexing the key in the memory and hence taking the same space (this is why I said independent of the key type). Its not like int will be 4 bytes and and a string will take len(string) bytes space.

hashcode55
  • 5,622
  • 4
  • 27
  • 40
  • Thx, so no ways to reduce the size of my dictionary? – ylnor Mar 01 '17 at 14:07
  • @ylnor at-least you cannot reduce the size by altering the type of key. – hashcode55 Mar 01 '17 at 14:09
  • Note that this has nothing to do with the number of slots in the dict; the OP expected `sys.getsizeof()` to include the memory use for the keys and values, so swapping out one key type for another would result in less memory. Sure, the overall Python heap size can be reduced, you just won't see that reflected in the `sys.getsizeof()` return value. – Martijn Pieters Mar 01 '17 at 14:18
  • @MartijnPieters I really don't think, it'll change the size. The dictionary starts with 3 bit hash (this is independent of the **type** of key used, the only thing left is that hash value). – hashcode55 Mar 01 '17 at 14:22
  • I'm not sure what you mean there. Are you saying that `{1: 'foo'}` takes as much memory as `{'one': 'foo'}`? Both dicts have the same number of slots, and `sys.getsizeof()` returns the *exact same memory size*. But that's because that doesn't include the sizes of `1` and `'one'`. See http://code.activestate.com/recipes/577504/ for a thorough function that gives you the total amount of memory referenced by a container object. – Martijn Pieters Mar 01 '17 at 14:25
  • @MartijnPieters I'm just saying when python uses the `hash` function to compute the keys hash value, both will be integers indexing the key in the memory and hence taking the same space (this is why I said independent of the `key` type). Its not like `1` will be 4 byte and `one` will be 3 byte. – hashcode55 Mar 01 '17 at 14:29
  • @MartijnPieters Thanks for the link btw. – hashcode55 Mar 01 '17 at 14:32
  • Right, yes, the hash value is a fixed size. – Martijn Pieters Mar 01 '17 at 14:33