10

I am looking for a hash functions family generator that could generate a family of hash functions given a set of parameters. I haven't found any such generator so far. Is there a way to do that with the hashlib package ?

For example I'd like to do something like :

h1 = hash_function(1)
h2 = hash_function(2)
...

and h1 and h2 would be different hash functions.

For those of you who might know about it, I am trying to implement a min-hashing algorithm on a very large dataset.

Basically, I have a very large set of features (100 millions to 1 billion) for a given document, and I need to create 1000 to 10000 different random permutations for this set of features.

I do NOT want to build the random permutations explicitly so the technique I would like to use in the following :

  1. generate a hash function h and consider that for two indices r and s
  2. r appears before s in the permutation if h(r) < h(s) and do that for 100 to 1000 different hash functions.

Are there any known libraries that I might have missed ? Or any standard way of generating families of hash functions with python that you might be aware of ?

JJJ
  • 32,902
  • 20
  • 89
  • 102
Nicolas M.
  • 789
  • 1
  • 13
  • 23

5 Answers5

10

I'd just do something like (if you don't need thread-safety -- not hard to alter if you DO need thread safety -- and assuming a 32-bit Python version):

import random

_memomask = {}

def hash_function(n):
  mask = _memomask.get(n)
  if mask is None:
    random.seed(n)
    mask = _memomask[n] = random.getrandbits(32)
  def myhash(x):
    return hash(x) ^ mask
  return myhash
Alex Martelli
  • 854,459
  • 170
  • 1,222
  • 1,395
  • 1
    Thanks for this answer. It seems to work great. Any particular for using those type of hash functions ? efficiency ? will yield very different approximate permutations in some sense ? – Nicolas M. Feb 12 '10 at 23:30
  • The built-in `hash` is decent and pretty efficient -- xor'ing it with a number depending (but in a sufficiently chaotic way) from the index within the family just seems another decent/efficient way to turn that one hash function into a family. If speed's not a problem you could use stronger (crypto-quality) hashing, I guess -- that will presumably give you higher quality (neither hash nor random are crypto-quality and thus neither's their XOR;-) but the speed impact is REALLY large (orders of magnitude...). – Alex Martelli Feb 13 '10 at 00:07
  • Thanks. Actually, I believe that speed will be key for me here. The only "quality" I am looking for is that the hash functions will generate "as different" random permutations as possible ( I am not sure how to quantify this though...) by the process I described in my original question. Again, thanks a lot for your great answer. – Nicolas M. Feb 13 '10 at 00:15
  • This does not work and is a very poor choice for almost every use of hash families. If you intend to use this for hash tables where you probe multiple locations based on hashs (cuckoo, 2 way hash, etc..) then this is an _extremely_ bad choice and no different than using a single function. the entire point of using different hash functions is that a different collision pattern will happen, when you xor the output of your hash with a constant then it does not change the collisions at all, the same keys that collide in one will collide in another. – John Meacham Apr 30 '21 at 05:10
3

As mentioned above, you can use universal hashing for minhash. For example:

import random



def minhash():
    d1 = set(random.randint(0, 2000) for _ in range(1000))
    d2 = set(random.randint(0, 2000) for _ in range(1000))
    jacc_sim = len(d1.intersection(d2)) / len(d1.union(d2))
    print("jaccard similarity: {}".format(jacc_sim))

    N_HASHES = 200
    hash_funcs = []
    for i in range(N_HASHES):
        hash_funcs.append(universal_hashing())

    m1 = [min([h(e) for e in d1]) for h in hash_funcs]
    m2 = [min([h(e) for e in d2]) for h in hash_funcs]
    minhash_sim = sum(int(m1[i] == m2[i]) for i in range(N_HASHES)) / N_HASHES
    print("min-hash similarity: {}".format(minhash_sim))



def universal_hashing():
    def rand_prime():
        while True:
            p = random.randrange(2 ** 32, 2 ** 34, 2)
            if all(p % n != 0 for n in range(3, int((p ** 0.5) + 1), 2)):
                return p
    m = 2 ** 32 - 1
    p = rand_prime()
    a = random.randint(0, p)
    if a % 2 == 0:
        a += 1
    b = random.randint(0, p)
    def h(x):
        return ((a * x + b) % p) % m
    return h

Reference

k3f9f2kf2
  • 81
  • 3
  • Tried to edit your answer but It must have been more than 6 chars. There is a syntax error, correct it to: 'minhash_sim = sum([int(m1[i] == m2[i]) for i in range(N_HASHES)]) / N_HASHES' – Massoud Jan 21 '19 at 17:06
1

@alex's answer is great and concise, but the hash functions it generates are not "very different from each other".

Let's look at the Pearson correlation between 10000 samples of 10000 hashes that put the results in 100 bins

%%time # 1min 14s
n=10000
hashes = [hash_function(i) for i in range(n)]
median_pvalue(hashes, n=n)
# 1.1614081043690444e-06

I.e. the median p_value is 1e-06 which is far from random. Here's an example if it were truly random :

%%time # 4min 15s
hashes = [lambda _ : random.randint(0,100) for _ in range(n)]
median_pvalue(hashes, n=n)
# 0.4979718236429698

Using Carter and Wegman method you could get:

%%time # 1min 43s
hashes = HashFamily(100).draw_hashes(n)
median_pvalue(hashes, n=n)
# 0.841929288037321

Code to reproduce :


from scipy.stats.stats import pearsonr 
import numpy as np
import random

_memomask = {}

def hash_function(n):
    mask = _memomask.get(n)
    if mask is None:
        random.seed(n)
        mask = _memomask[n] = random.getrandbits(32)
    def myhash(x):
        return hash(x) ^ mask
    return myhash

class HashFamily():
    r"""Universal hash family as proposed by Carter and Wegman.
    .. math::
            \begin{array}{ll}
            h_{{a,b}}(x)=((ax+b)~{\bmod  ~}p)~{\bmod  ~}m \ \mid p > m\\
            \end{array}
    Args:
        bins (int): Number of bins to hash to. Better if a prime number.
        moduler (int,optional): Temporary hashing. Has to be a prime number.
    """
    def __init__(self, bins, moduler=None):
        if moduler and moduler <= bins:
            raise ValueError("p (moduler) should be >> m (buckets)")

        self.bins = bins
        self.moduler = moduler if moduler else self._next_prime(np.random.randint(self.bins + 1, 2**32))

        # do not allow same a and b, as it could mean shifted hashes
        self.sampled_a = set()
        self.sampled_b = set()

    def _is_prime(self, x):
        """Naive is prime test."""
        for i in range(2, int(np.sqrt(x))):
            if x % i == 0:
                return False
        return True

    def _next_prime(self, n):
        """Naively gets the next prime larger than n."""
        while not self._is_prime(n):
            n += 1

        return n

    def draw_hash(self, a=None, b=None):
        """Draws a single hash function from the family."""
        if a is None:
            while a is None or a in self.sampled_a:
                a = np.random.randint(1, self.moduler - 1)
                assert len(self.sampled_a) < self.moduler - 2, "please give a bigger moduler"

            self.sampled_a.add(a)
        if b is None:
            while b is None or b in self.sampled_b:
                b = np.random.randint(0, self.moduler - 1)
                assert len(self.sampled_b) < self.moduler - 1, "please give a bigger moduler"

            self.sampled_b.add(b)

        return lambda x: ((a * x + b) % self.moduler) % self.bins

    def draw_hashes(self, n, **kwargs):
        """Draws n hash function from the family."""
        return [self.draw_hash() for i in range(n)]

def median_pvalue(hashes, buckets=100, n=1000):
    p_values = []
    for j in range(n-1):
        a = [hashes[j](i) % buckets for i in range(n)]
        b = [hashes[j+1](i) % buckets for i in range(n)]
        p_values.append(pearsonr(a,b)[1])
    return np.median(p_values)

Note that my implementation is of Carter and Wegman is very naive (e.g. generation of prime numbers). It could be made shorter and quicker.

Yann Dubois
  • 1,195
  • 15
  • 16
0

You should consider using universal hashing. My answer and code can be found here: https://stackoverflow.com/a/25104050/207661

Community
  • 1
  • 1
Shital Shah
  • 63,284
  • 17
  • 238
  • 185
0

The universal hash family is a set of hash functions H of size m, such that any two (district) inputs collide with probability at most 1/m when the hash function h is drawn randomly from set H.

universal hash family

Based on the formulation in Wikipedia, use can use the following code:

import random

def is_prime(n):
    if n==2 or n==3: return True
    if n%2==0 or n<2: return False
    for i in range(3, int(n**0.5)+1, 2):
        if n%i==0:
            return False    
    return True

# universal hash functions
class UniversalHashFamily:
    def __init__(self, number_of_hash_functions, number_of_buckets, min_value_for_prime_number=2, bucket_value_offset=0):
        self.number_of_buckets = number_of_buckets
        self.bucket_value_offset = bucket_value_offset
        
        primes = []
        number_to_check = min_value_for_prime_number
        while len(primes) < number_of_hash_functions:
            if is_prime(number_to_check):
                primes.append(number_to_check)
            number_to_check += random.randint(1, 1000)

        self.hash_function_attrs = []
        for i in range(number_of_hash_functions):
            p = primes[i]
            a = random.randint(1, p)
            b = random.randint(0, p)
            self.hash_function_attrs.append((a, b, p))
    
    def __call__(self, function_index, input_integer):
        a, b, p = self.hash_function_attrs[function_index]
        return (((a*input_integer + b)%p)%self.number_of_buckets) + self.bucket_value_offset

Example usage:

We can create a hash family consists of 20 hash functions, each one map the input to 100 buckets.

hash_family = UniversalHashFamily(20, 100)

And get the hashed values like:

input_integer = 1234567890 # sample input

hash_family(0, input_integer) # the output of the first hash function, i.e. h0(input_integer)
hash_family(1, input_integer) # the output of the second hash function, i.e. h1(input_integer)
# ...
hash_family(19, input_integer) # the output of the last hash function, i.e. h19(input_integer)

If you are interested in the universal hash family for string inputs, you can use the following code. But please note that this code may not be the optimized solution for string hashing.

class UniversalStringHashFamily:
    def __init__(self, number_of_hash_functions, number_of_buckets, min_value_for_prime_number=2, bucket_value_offset=0):
        self.number_of_buckets = number_of_buckets
        self.bucket_value_offset = bucket_value_offset
        
        primes = []
        number_to_check = max(min_value_for_prime_number, number_of_buckets)
        while len(primes) < number_of_hash_functions:
            if is_prime(number_to_check):
                primes.append(number_to_check)
            number_to_check += random.randint(1, 1000)

        self.hash_function_attrs = []
        for i in range(number_of_hash_functions):
            p = primes[i]
            a = random.randint(1, p)
            a2 = random.randint(1, p)
            b = random.randint(0, p)
            self.hash_function_attrs.append((a, b, p, a2))
    
    def hash_int(self, int_to_hash, a, b, p):
        return (((a*int_to_hash + b)%p)%self.number_of_buckets) + self.bucket_value_offset
        
    def hash_str(self, str_to_hash, a, b, p, a2):
        str_to_hash = "1" + str_to_hash # this will ensure that universality is not affected, see wikipedia for more detail
        l = len(str_to_hash)-1
        
        int_to_hash = 0
        for i in range(l+1):
            int_to_hash += ord(str_to_hash[i]) * (a2 ** (l-i))
        int_to_hash = int_to_hash % p
        return self.hash_int(int_to_hash, a, b, p)
    
    def __call__(self, function_index, str_to_hash):
        a, b, p, a2 = self.hash_function_attrs[function_index]
        return self.hash_str(str_to_hash, a, b, p, a2)
Mohsenasm
  • 2,916
  • 1
  • 18
  • 22