2

Let's say I have a referentially transparent function. It is very easy to memoize it; for example:

def memoize(obj):
  memo = {}
  @functools.wraps(obj)
  def memoizer(*args, **kwargs):
    combined_args = args + (kwd_mark,) + tuple(sorted(kwargs.items()))
    if combined_args not in memo:
      memo[combined_args] = obj(*args, **kwargs)
    return cache[combined_args]
  return memoizer

@memoize
def my_function(data, alpha, beta):
  # ...

Now suppose that the data argument to my_function is huge; say, it's a frozenset with millions of elements. In this case, the cost of memoization is prohibitive: every time, we'd have to calculate hash(data) as part of the dictionary lookup.

I can make the memo dictionary an attribute to data instead of an object inside memoize decorator. This way I can skip the data argument entirely when doing the cache lookup since the chance that another huge frozenset will be the same is negligible. However, this approach ends up polluting an argument passed to my_function. Worse, if I have two or more large arguments, this won't help at all (I can only attach memo to one argument).

Is there anything else that can be done?

Community
  • 1
  • 1
max
  • 49,282
  • 56
  • 208
  • 355
  • When you define `my_function`, do you know that you always want to skip memoization of `data`? – Janne Karila Oct 04 '12 at 11:20
  • I don't want to skip it. But if I make the `memo` object an attribute of the `data` parameter passed to me, I don't need to include it in the key any longer: after all, by construction a separate `memo` will be used for each `data` object! – max Oct 04 '12 at 11:23

2 Answers2

1

Well, you can use "hash" there with no fears. A frozenset's hash is not calculated more than once by Python - just when it is created - check the timings:

>>> timeit("frozenset(a)", "a=range(100)")
3.26825213432312
>>> timeit("hash(a)", "a=frozenset(range(100))")
0.08160710334777832
>>> timeit("(lambda x:x)(a)", "a=hash(frozenset(range(100)))")
0.1994171142578125

Don't forget Python's "hash" builtin calls the object's __hash__ method, which has its return value defined at creation time for built-in hasheable objects. Above you can see that calling a identity lambda function is more than twice slower than calling "hash (a)"

So, if all your arguments are hasheable, just add their hash when creating "combined_args" - else, just write its creation so that you use hash for frozenset (and maybe other) types, with a conditional.

jsbueno
  • 99,910
  • 10
  • 151
  • 209
  • 1
    While the hash value is indeed not calculated more than once, it is not calculated during creation. And for a large object it's nearly as expensive as the object creation. But I agree, `hash` is not the problem. It seems `__eq__` is a much bigger problem. – max Oct 04 '12 at 12:01
  • The performance impact is only visible when you try `frozenset(range(100000000))` or something like that. – max Oct 04 '12 at 12:11
  • Hmm if hash value is not calculated more than once, how come python's own `lru_cache` implementation [created a special class](https://github.com/python/cpython/blob/da152dadf5384e47063630bc54bb664899ffa988/Lib/functools.py#L347-L352) just for the purpose of caching the hash value of tuples? – max Aug 05 '16 at 20:33
1

It turns out that the built-in __hash__ is not that bad, since it caches its own value after the first calculation. The real performance hit comes from the built-in __eq__, since it doesn't short-circuit on identical objects, and actually goes through the full comparison every time, making it very costly.

One approach I thought of is to subclass the built-in class for all large arguments:

class MyFrozenSet(frozenset):
  __eq__ = lambda self, other : id(self) == id(other)
  __hash__ = lambda self : id(self)

This way, dictionary lookup would be instantaneous. But the equality for the new class will be broken.

A better solution is probably this: only when the dictionary lookup is performed, the large arguments can be wrapped inside a special class that redefines __eq__ and __hash__ to be return the wrapped object's id(). The obvious implementation of the wrapper is a bit annoying, since it requires copying all the standard frozenset methods. Perhaps deriving it from the relevant ABC class may make it easier.

max
  • 49,282
  • 56
  • 208
  • 355
  • Why not just store the argument's hash? – jsbueno Oct 04 '12 at 12:02
  • Because then you have to be absolutely sure that you won't end up with two different objects with the same hash. I agree the chances of this are nearly negligible, but it is strictly speaking unsafe. – max Oct 04 '12 at 12:06
  • Don't subclass, *wrap*. Subclassing should (to first approximation) obey the [Liskov substitution principle](https://en.wikipedia.org/wiki/Liskov_substitution_principle), and your version certainly doesn't. Even in cases where it does hold, wrapping with a shared API is often better. Also, use `def __hash__(self): return hash(id(self.value))`: that extra `hash` is important for distribution. – Veedrac Aug 05 '16 at 15:06
  • @Veedrac with wrapped approach, which methods would I need to define in the wrapper class? – max Aug 05 '16 at 23:04
  • @max `__eq__` and `__hash__`. – Veedrac Aug 05 '16 at 23:56
  • @Veedrac I think you're right, I'm not sure why I thought I'd need to copy any other methods, if the only way the wrapped objects are used is to be stored and looked up in the memoization dictionary. – max Aug 06 '16 at 18:23