1

I am trying to implement the hyperloglog counting algorithm using stochastic averaging. To do that, I need many independent universal hash functions to hash items in different substreams.

I found that there are only a few hash function available in hashlib and there seems to be no way for me to provide a seed or something? I am thinking using different salts for different substreams.

Louis Kuang
  • 727
  • 1
  • 14
  • 30
  • I'm no expert, but since there's going to be collisions anyway can't you just add the salt post-hashing, i.e. to the hash itself? Not sure what you mean by "independent", what the actual requirement/expectation is there. – unwind Apr 20 '16 at 08:34
  • @unwind If I were to use salt, what library functions should I use cuz I could not find one. – Louis Kuang Apr 20 '16 at 08:39
  • 1
    Sorry, library recommendations are off-topic on Stack Overflow. But anyway... the hashlib functions are [cryptographic hash functions](https://en.wikipedia.org/wiki/Cryptographic_hash_function), they _can_ be used for making hash tables, etc, but they are relatively slow. Perhaps you could do something with Python's built-in `hash()` function combined with the `h(a,b,x) = (a*x+b) % p % m` formula from the Wikipedia article on [universal hashing](https://en.wikipedia.org/wiki/Universal_hashing#Hashing_integers). – PM 2Ring Apr 20 '16 at 08:52
  • But I am trying to hash a string object. I thought built-in hash() can only hash int objects – Louis Kuang Apr 20 '16 at 08:55
  • @LouisKwong Did you try using `hash` on a string. – J Richard Snape Apr 20 '16 at 09:03
  • Built-in `hash()` can hash strings or any other immutable built-in type . It's used internally to hash `dict` keys and `set` items. `hash()` can also hash custom objects, but in that case it simply invokes the object's `__hash__` method. – PM 2Ring Apr 20 '16 at 09:03
  • @PM2Ring I'd stay away from `hash()` if the hyperloglog sets need to be shareable among implementations – Antti Haapala -- Слава Україні Apr 20 '16 at 11:27
  • @AnttiHaapala: Fair point. But the hash will be reproducible if an explicit `PYTHONHASHSEED` is supplied, won't it? – PM 2Ring Apr 20 '16 at 11:31
  • 2
    No. then there is the issue of 2 vs 3, and that a new python version could use a different algorithm, and 32 vs 64-bits – Antti Haapala -- Слава Україні Apr 20 '16 at 11:32
  • Removed the resource request part from the question - not appropriate to ask some libraries. – Antti Haapala -- Слава Україні Apr 20 '16 at 11:45

1 Answers1

1

You probably DON'T need different hash functions. A common solution to this problem is to use only part of the hash to compute the HyperLogLog rho statistic, and the other part to select the substream. If you use a good hash function (e.g. murmur3), it effectively behaves as multiple independent ones.

See the "stochastic averaging" section here for an explanation of this: https://research.neustar.biz/2012/10/25/sketch-of-the-day-hyperloglog-cornerstone-of-a-big-data-infrastructure/

OronNavon
  • 1,293
  • 8
  • 18
  • However, Python does not have a `murmur3` implementation built in; perhaps just use a cryptographic hash function such as `md5`, which would give 128 bits in one go. – Antti Haapala -- Слава Україні Apr 20 '16 at 11:37
  • Good point, although if you're not constrained, I would go ahead and consume an external murmur3 implementation. In any case, you need to make sure that your hash function meets your speed requirements (note that cryptographic hash functions are slow), as well as the hash length requirements (at least 64 bits. 128 is overkill, but you don't have to use all the bits). – OronNavon Apr 20 '16 at 14:02