5

Following this question, we know that two different dictionaries, dict_1 and dict_2 for example, use the exact same hash function.

Is there any way to alter the hash function used by the dictionary?Negative answers also accepted!

Community
  • 1
  • 1
gsamaras
  • 71,951
  • 46
  • 188
  • 305
  • 2
    Hey, nice formatting! I didn't knew that you can use a tag or sub-placed text in an post... – linusg May 08 '16 at 19:47
  • Same before someone taught me @linusg! Use [...tag: ...Python] without the dots and enclose text in sub and sup tags to get the result. Click edit on my question to see exactly how is this done! – gsamaras May 08 '16 at 19:48
  • 1
    I narrowed down the question @ReutSharabani. – gsamaras May 08 '16 at 20:01
  • May I ask: Why do you want to alter the hash function? – Tadhg McDonald-Jensen May 08 '16 at 20:04
  • 1
    @TadhgMcDonald-Jensen sure. There was an attempt to explain that before my edit. Now I would like not to edit again and go back. I want that for this reason: http://stackoverflow.com/questions/37089971/confusion-in-hashing-in-lsh – gsamaras May 08 '16 at 20:06
  • I'm guessing that storing the data based on it's `id` wouldn't be sufficient? – Tadhg McDonald-Jensen May 08 '16 at 20:35

3 Answers3

5

You can't change the hash-function - the dict will call hash on the keys it's supposed to insert, and that's that.

However, you can wrap the keys to provide different __hash__ and __eq__-Methods.

class MyHash(object):
     def __init__(self, v):
         self._v = v

     def __hash__(self):
         return hash(self._v) * -1

     def __eq__(self, other):
         return self._v == other._v

If this actually helps anything with your original problem/question I doubt though, it seems rather a custom array/list-based data-structure might be the answer. Or not.

deets
  • 6,285
  • 29
  • 28
  • what do you mean by "doesn't require me to implement `__eq__`."? – Tadhg McDonald-Jensen May 08 '16 at 20:25
  • when I do `my_dict = {} ; my_dict[MyHash("foo")] = 4 ; print(my_dict[MyHash("foo")])` I get a KeyError, I think you still need to implement `__eq__` for this to work correctly. – Tadhg McDonald-Jensen May 08 '16 at 20:27
  • @TadhgMcDonald-Jensen my comment was actually referring to the equality of the _v being still a match for the simple modification of `__hash__`, but that doesn't change that I have to forward to the original `__eq__`. Thanks for pointing that out. – deets May 08 '16 at 20:31
2

Here is a "hash table" on top of a list of lists, where each hash table object is associated with a particular hashing function.

class HashTable(object):
    def __init__(self, hash_function, size=256):
        self.hash_function = hash_function
        self.buckets = [list() for i in range(size)]
        self.size = size

    def __getitem__(self, key):
        hash_value = self.hash_function(key) % self.size
        bucket = self.buckets[hash_value]
        for stored_key, stored_value in bucket:
            if stored_key == key:
                return stored_value
        raise KeyError(key)


    def __setitem__(self, key, value):
        hash_value = self.hash_function(key) % self.size
        bucket = self.buckets[hash_value]
        # if the key is in the bucket, replace the value and return
        for i, (stored_key, stored_value) in enumerate(bucket):
            if stored_key == key:
                 bucket[i] = (key, value)
                 return
        # otherwise append the key value pair to the bucket
        bucket.append((key, value))

The rest of your application can still see the underlying list of buckets. Your application might require additional metadata to be associated with each bucket, but that would be as simple as defining a new class for the elements of the bucket list instead of a plain list.

TheLizzard
  • 7,248
  • 2
  • 11
  • 31
mobiusklein
  • 1,403
  • 9
  • 12
  • Why this got a downvote? I am going to upvote since there is no justification here. :) – gsamaras May 08 '16 at 20:11
  • It might have been because I had initially forgotten a to increment `i` in the `__setitem__` for loop, which it took me 40 seconds to notice, or that I didn't comment a single line of code. Is the class definition self evident, or should I document more? – mobiusklein May 08 '16 at 20:24
  • Your answer, your choice. :) The length is not a must for a good post. For example my [best answer](https://stackoverflow.com/questions/26716255/why-does-this-program-print-forked-4-times/26716300#26716300) was quite small at first, but then I felt like expanding it, worth it! My [best question](https://stackoverflow.com/questions/30614396/what-does-i-i-i-1-1-do) is short (relatively). – gsamaras May 09 '16 at 07:03
1

I think what you want is a way to create buckets. Based on this I recommend collections.defaultdict with a set initializer as the "bucket" (depends on what you're using it for though).

Here is a sample:

#!/usr/bin/env python

from collections import defaultdict
from itertools import combinations

d = defaultdict(set)

strs = ["str", "abc", "rts"]
for s in strs:
    d[hash(s)].add(s)
    d[hash(''.join(reversed(s)))].add(s)

for combination in combinations(d.values(), r=2):
    matches = combination[0] & combination[1]
    if len(matches) > 1:
        print matches

# output: set(['str', 'rts'])

Two strings ending up in the same buckets here are very likely the same. I've created a hash collision by using the reverse function and using a string and it's reverse as values.

Note that the set will use full comparison but should do it very fast.

Don't hash too many values without draining the sets.

Reut Sharabani
  • 30,449
  • 6
  • 70
  • 88