2

I have read many stackoverflow answers regarding dictionary keys to be immutable.

I read below answers like

Why must dictionary keys be immutable?

Below is my understanding

  1. that a hash code is generated for each dictionary keys with the help of which retrieval is done, when we do something like d["name"], which makes it more efficient then list or tuple, since it knows where it lives in memory.

  2. if mutable datatypes are allowed to be dictionary key, if key changes new hashcode will be generated and dictionary wont be able to find the this hashcode, since hashcode is changed and it has no memory.

my question is, why can't dictionary store this new hashcode and make use of it during retrieval. why is this not acceptable, what is the downside of it?

if i am not clear on my question, please let me know, what am i missing, will be more descriptive.

Thanks in advance!

young_minds1
  • 1,181
  • 3
  • 10
  • 25
  • The dictionary isn't informed when the object is modified, so it can't recompute the hash code. – Barmar Sep 11 '21 at 06:21
  • 1
    Does this answer your question? [Why must dictionary keys be immutable?](https://stackoverflow.com/questions/24217647/why-must-dictionary-keys-be-immutable) – Joe Sep 11 '21 at 07:01

2 Answers2

2

Actually, you CAN make mutable keys in Python; the code below is an example. You just REALLY don't want to do that because you lose access to dictionary entries, so developers generally make sure their mutable types throw some error when you try to make an instance a key.

You ask why dictionaries can't just recompute the hash code so mutable keys never lose access to their values. Well look at the example. The mutation is badKey.data = 20 and has nothing to do with myDict, so myDict doesn't know to rehash.

It's hypothetically possible to make a dictionary rehash when a key is mutated, but it requires that the dictionary be a part of the key. That just creates lots of headaches:

  1. badKey has to circularly contain myDict which contains badKey, which is extra work for the garbage collector,
  2. badKey can be in multiple dictionaries, so badKey must contain all of them,
  3. why go to all this trouble when you can either use id(badKey), a unique integer ID for badKey, or some immutable version of badKey as valid keys? For example of the latter, you can freeze a list's contents at one point into a tuple for a key.
class badHasher:
    'hash value depends on an attribute; this is very bad'
    def __init__(self, something):
        self.data = something
    def __hash__(self):
        return hash(self.data)
    def __repr__(self):
        return 'badHasher(' + str(self.data) + ')'

badKey = badHasher(10)
myDict = {badKey: 'ten'}

# run below code line by line in REPL

print(myDict[badKey]) # ten

badKey.data = 20

print(myDict[badKey]) # KeyError: badHasher(20)

badKey.data = 10

print(myDict[badKey]) # ten
BatWannaBe
  • 4,330
  • 1
  • 14
  • 23
2

Since your question is asking about the tradeoffs / downsides between

  1. The current CPython model, where dictionaries have the hash of their keys as an invariant, and thus you are told 'only immutable keys are allowed' (despite there being sneaky ways to break this), and

  2. A theoretical implementation where, before any dictionary operation, the dictionary would rehash all of its keys (or, alternatively, be notified of hash changes from its elements) and reorganize any changed keys

The answer is, as expected, performance, safety, and threading, in some order. You can, of course, create your own mapping that works like (2), but you inevitably encounter serious tradeoffs. You need to know a bit about CPython to fully appreciate why this choice was made, and why it's unlikely to change. I'll be focusing on the performance reasons for the rest of the post.

Reason number one: Python has lots of dictionaries. Globals are stored in a dictionary; all instance attributes of classes are stored in the instance's __dict__. Dictionaries are fundamental to classes and to Python in general. Thus, dictionaries need to be fast, and they need to be small.

Reason number two: Because of the prevalence of dictionaries all over Python code and internals, they are highly optimized. For a good overview, see this answer which mentions a few optimizations that make recent Python3 dictionaries much smaller and faster than Python2: compact dictionaries and key-sharing. For a great oversimplification: Compact-dicts look different from your usual view of hash-tables. They are split into: 1. A sparse, half-empty (really, half-null) list of integers for indexing into 2: A dense, ordered list of (hash, key, value) structures, with no empty spaces. This makes dictionaries small and fast to iterate over or to add/lookup elements, but worse to delete items or rehash keys, which are comparatively rare.

To optimize on space further, if you have many small instances of a class, they'll often do 'key-sharing': Only one copy of a key is stored and shared between many dictionaries. With mutable keys, this seems harder to implement (however, I'm not 100% sure on this point: mutable keys might not be a big issue here).

Another aspect that's not discussed in depth in the linked post and more relevant to your proposal (and apparently not super well known), is found in PEP 509 - private versions in dicts. A major source of optimization for CPython is caching the result of dictionary lookups. Every dictionary in CPython 3.6 shares a global version number with every other dictionary, that is incremented whenever any dictionary is modified. On creation, dictionaries save the global version number locally. To check whether the cached value of a dictionary lookup can be used, the global version number is compared to the local version number: if they match, any cached values are safe to use.

Reason number three: CPython guarantees that dictionary methods are atomic via the Global Interpreter Lock (GIL); dictionaries are often used to pass or share data between threads and processes. Adding this protection to any possible mutations of keys, which would then become dictionary updates, is probably seen as an unacceptable performance tradeoff.

There are, as mentioned, many other reasons unrelated to performance that are probably even stronger concerns. Since those are more well known and less CPython specific, it's probably best to explore those existing posts and answers for the theoretical and usability reasons to want immutability by default. The short answer: Many Python beginner-traps have to do with mutability; changing a list changes all references to it, etc. Adding more dependent (and surprising) mutability to fundamental Python structures is unlikely to be seen favorably for those in the Python community who have seen hundreds of examples of Python students making similar mistakes due to otherwise-reasonable assumptions about Python internals.

kcsquared
  • 5,244
  • 1
  • 11
  • 36