3

I am writing a method to generate cache keys for caching function results, the key is based on a combination of function name and hash value of parameters.

Currently I am using hashlib to hash the serialized version of parameters, however the operation is very expensive to serialize large objects, so what's the alternative?

#get the cache key for storage
def cache_get_key(*args):
    import hashlib
    serialise = []
    for arg in args:
        serialise.append(str(arg))
    key = hashlib.md5("".join(serialise)).hexdigest()
    return key

UPDATE: I tried using hash(str(args)), but if args have relatively large data in it, still takes long time to compute the hash value. Any better way to do it?

Actually str(args) with large data takes forever...

Joe Doyle
  • 6,363
  • 3
  • 42
  • 45
James Lin
  • 25,028
  • 36
  • 133
  • 233
  • Your example doesn't align with your question. You're not hashing an object, you're hashing several objects. – Tim McNamara Apr 10 '12 at 22:01
  • 1
    Curious, why don't you just use a dictionary with function names keys to dictionarys with input arguments as keys to results? – 8bitwide Apr 10 '12 at 22:02
  • @TimMcNamara the current way right now I need to serialize all objects to string, then md5 it. – James Lin Apr 10 '12 at 22:04
  • @JamesLin At the very least, you probably want a nonempty string to use in `join` using non-hex characters. Consider the hashes of `("foo", "bar")` and `("fo", "obar")`. – Michael Mior Apr 10 '12 at 22:06
  • @8bitwide could you please elaborate a bit more? – James Lin Apr 10 '12 at 22:10
  • 1
    I might be confused about the question, so correct me if I'm way off base.The python dictionary is basically as hash with a lookup run time of more or less O(1). So make a dictionay { Functionname key:{}}. a dictionary of dictionaries. You index a dictionary by function name. That dictionary is {(*args tuple):function results}. A dictionary of results for that function, indexed by the arguments passed in. – 8bitwide Apr 10 '12 at 22:20
  • @8bitwide, I implemented what I think you were suggesting in my answer below (http://stackoverflow.com/a/10097150/10608). Is that generally what you were thinking? – Alexander Bird Apr 10 '12 at 22:31
  • @Thr4wn pretty much except use dictionaries {} instead of a lists – 8bitwide Apr 10 '12 at 22:35
  • What kind of cache are you doing? The goal is to be able to look up the result given (function name, args), yeah? So... I would just use an ordinary Python `dict`, which does its own hashing automatically. – Karl Knechtel Apr 10 '12 at 23:39
  • @KarlKnechtel I am doing memcache caching. – James Lin Apr 11 '12 at 02:29
  • @8bitwide I see what you are saying, I am implementing a cache decorator which is similar to django-cache-util, and primarily being used in django code, if I do it the way you mentioned, for every access to the cached value, I will need to get the whole collection of hashtable due to the way how http process works. – James Lin Apr 11 '12 at 02:32

5 Answers5

1

Have you tried just using the hash function? It works perfectly well on tuples.

Marcin
  • 48,559
  • 18
  • 128
  • 201
1

Assuming you made the object, and it is composed of smaller components (it is not a binary blob), you can precompute the hash when you build the object by using the hashes of its subcomponents.

For example, rather than serialize(repr(arg)), do arg.precomputedHash if isinstance(arg, ...) else serialize(repr(arg))

If you neither make your own objects nor use hashable objects, you can perhaps keep a memoization table of objectreferences -> hashes, assuming you don't mutate the objects. Worst case, you can use a functional language which allows for memoization, since all objects in such a language are probably immutable and hence hashable.

ninjagecko
  • 88,546
  • 24
  • 137
  • 145
  • how could you precompute hash from the smaller hashes? You would still have to dynamically combine those smaller hashes and hash them. You can't precompute that, right? Or am I missing something? – Alexander Bird Apr 10 '12 at 22:09
  • @Thr4wn: Suppose we called these objects `MyObj`. Then `MyObj(MyObj(foo, bar, baz), MyObj(bing, biz))` would only require 1 hash at the time you combine the two `MyObj`s into the larger `MyObj`. You can precompute the hashes of the smaller objects ahead of time. Even if you didn't have this structure, the point is that at some point you needed to read all that data in an O(datasize) operator. Therefore you should be able to do the hashing at object-creation/reading time rather than at the last minute. – ninjagecko Apr 10 '12 at 22:21
1
def cache_get_key(*args):
    return hash(str(args))

or (if you really want to use the hashlib library)

def cache_get_key(*args):
    return hashlib.md5(str(args)).hexdigest()

I wouldn't bother rewriting code to make arrays into strings. Use the inbuilt one.

alternative solution

below is the solution @8bitwide suggested. No hashing required at all with this solution!

def foo(x, y):
    return x+y+1

result1 = foo(1,1)
result2 = foo(2,3)

results = {}
results[foo] = {}
results[foo][ [1,1] ] = result1
results[foo][ [2,3] ] = result2
Alexander Bird
  • 38,679
  • 42
  • 124
  • 159
1

I know this question is old, but I just want to add my 2 cents:

  1. You don't have to create a list and then join it. Especially if the list is going to be discarded anyways. Use the hash function's .update() method

  2. Consider using a much faster non-cryptographic hash algo, especially if this is not meant to be a cryptographically-secure implementation.

Having said that, this is my suggested improvement:

import xxhash

#get the cache key for storage
def cache_get_key(*args):
    hasher = xxhash.xxh3_64()
    for arg in args:
        hasher.update(str(arg))
    return hasher.hexdigest()

This uses the (claimed to be) extremely fast xxHash NCHF*.


* NCHF = Non-Cryptographic Hash Function

pepoluan
  • 6,132
  • 4
  • 46
  • 76
  • str(arg) is not complete. e.g., in a large list it just contains few items – Ali Sep 23 '22 at 15:58
  • @Ali I am not sure I understand what you meant there. Do note that `arg` is iterated over `args` so it _will_ catch every single item in `args`. – pepoluan Sep 26 '22 at 07:10
  • assume that one of the arguments is numpy array, therefore the str function will not return all the elements. – Ali Sep 27 '22 at 12:29
  • @Ali I tried `import numpy as np; print(str(np.array([1, 2, 3, 4, 5, 6])))` it prints out `[1 2 3 4 5 6]`. I still see no problem. – pepoluan Oct 04 '22 at 08:59
  • @Ali Try this also: `a = np.array([1, 2, 3]); b = [11, a, 12]; c = [str(i) for i in b]; print(c)` you'll get `['11', '[1 2 3]', '12']`, so as you can see iterating over the list `b` and getting each element's `str()` works. – pepoluan Oct 04 '22 at 09:08
  • please try: `import numpy as np; print(str(np.array(np.ones((100,100,100)))))` to see the result – Ali Oct 04 '22 at 16:00
0

I've seen people feed an arbitrary python object to random.seed(), and then use the first value back from random.random() as the "hash" value. It doesn't give a terrific distribution of values (can be skewed), but it seems to work for arbitrary objects.

If you don't need cryptographic-strength hashes, I came up with a pair of hash functions for a list of integers that I use in a bloom filter. They appear below. The bloom filter actually uses linear combinations of these two hash functions to obtain an arbitrarily large number of hash functions, but they should work fine in other contexts that just need a bit of scattering with a decent distribution. They're inspired by Knuth's writing on Linear Congruential Random Number Generation. They take a list of integers as input, which I believe could just be the ord()'s of your serialized characters.

MERSENNES1 = [ 2 ** x - 1 for x in [ 17, 31, 127 ] ]
MERSENNES2 = [ 2 ** x - 1 for x in [ 19, 67, 257 ] ]


def simple_hash(int_list, prime1, prime2, prime3):
    '''Compute a hash value from a list of integers and 3 primes'''
    result = 0
    for integer in int_list:
        result += ((result + integer + prime1) * prime2) % prime3
    return result


def hash1(int_list):
    '''Basic hash function #1'''
    return simple_hash(int_list, MERSENNES1[0], MERSENNES1[1], MERSENNES1[2])


def hash2(int_list):
    '''Basic hash function #2'''
    return simple_hash(int_list, MERSENNES2[0], MERSENNES2[1], MERSENNES2[2])
user1277476
  • 2,871
  • 12
  • 10