Which is a more efficient hash implementation in Python objects? Hashing self.str, a tuple of attributes, or a real hash routine?

Question

It seems a common and quick way to create a stock __hash__() for any given Python object is to return hash(str(self)), if that object implements __str__(). Is this efficient, though? Per this SO answer, a hash of a tuple of the object's attributes is "good", but doesn't seem to indicate if it's the most efficient for Python. Or would it be better to implement a __hash__() for each object and use a real hashing algorithm from this page and mixup the values of the individual attributes into the final value returned by __hash__()?

Pretend I've implemented the Jenkins hash routines from this SO question. Which __hash__() would be better to use?:

# hash str(self)
def __hash__(self):
    return hash(str(self))

# hash of tuple of attributes
def __hash__(self):
    return hash((self.attr1, self.attr2, self.attr3,
                 self.attr4, self.attr5, self.attr6))

# jenkins hash
def __hash__(self):
    from jenkins import mix, final
    a = self.attr1
    b = self.attr2
    c = self.attr3
    a, b, c = mix(a, b, c)
    a += self.attr4
    b += self.attr5
    c += self.attr6
    a, b, c = final(a, b, c)
    return c

Assume the attrs in the sample object are all integers for simplicity. Also assume that all objects derive from a base class and that each objects implements its own __str__(). The tradeoff in using the first hash is that I could implement that in the base class as well and not add additional code to each of the derived objects. But if the second or third __hash__() implementations are better in some way, does that offset the cost of the added code to each derived object (because each may have different attributes)?

Edit: the import in the third __hash__() implementation is there only because I didn't want to draft out an entire example module + objects. Assume that import really happens at the top of the module, not on each invocation of the function.

Conclusion: Per the answer and comments on this closed SO question, it looks like I really want the tuple hash implementation, not for speed or efficiency, but because of the underlying duality of __hash__ and __eq__. Since a hash value is going to have a limited range of some form (be it 32 or 64 bits, for example), in the event you do have a hash collision, object equality is then checked. So since I do implement __eq__() for each object by using tuple comparison of self/other's attributes, I also want to implement __hash__() using an attribute tuple so that I respect the hash/equality nature of things.

Weren't you curious, why `tuple` is hashable and `list` is not? I am simplifying, but this may actually help you answer the question yourself. — Tadeck, Jan 18 '13 at 00:32
@Tadeck: I actually wasn't aware that tuples are hashable while lists aren't. I do use tuples of object properties to check for equality, but haven't investigated them for hash implementations yet. — Kumba, Jan 18 '13 at 00:33
I was less talking about implementation detail, more about the _reason_ it is done this way. Also please note `tuple` is not mutable, while `list` is. — Tadeck, Jan 18 '13 at 00:36
I should add, I am converting code from a VB.NET project to Python. There's no such thing as a tuple in .NET, so there, each object had to roll its own `GetHashCode()` (along with a bunch of other equality interfaces). — Kumba, Jan 18 '13 at 00:36
@Tadeck: Yes, I am aware that tuples are immutable while lists aren't. — Kumba, Jan 18 '13 at 00:37

kindall · Accepted Answer · 2013-01-18T18:30:51.103

3

Your second one has an important performance pessimization: it's importing two names each time the function is called. Of course, how performant it is relative to the string-hash version depends on how the string is generated.

That said, when you have attributes that define equality for the object, and those attributes are themselves hashable types, the simplest (and almost certainly best-performing) approach is going to be to hash a tuple containing those attribute values.

def __hash__(self):
    return hash((self.attr1, self.attr2, self.attr3))

edited Jan 18 '13 at 18:30

answered Jan 18 '13 at 00:42

kindall

178,883
35
278
309

Yeah, I know about the import overhead. I didn't feel like drafting out an entire example module + objects for the example, so I stuck it there. Let me edit the question to annotate that. – Kumba Jan 18 '13 at 00:44
Regarding the tuple approach, I did read about that in [this SO answer](http://stackoverflow.com/a/4005412), however, the creator of that answer simply said it was "good", and the tone implies there might be better ways. A search on SO doesn't turn up (within the first few pages) questions addressing this specific topic, so I thought I'd ask. Guess I should add the tuple example to the question, too? – Kumba Jan 18 '13 at 00:48

Which is a more efficient __hash__ implementation in Python objects? Hashing self.__str__, a tuple of attributes, or a real hash routine?

1 Answers1

Which is a more efficient hash implementation in Python objects? Hashing self.str, a tuple of attributes, or a real hash routine?