3

I've done some searching and found a few useful posts about supporting perfect (i.e. no collision) hashing in Java.

Why doesn't Java's hashCode support universal hashing?

Is it possible in java make something like Comparator but for implementing custom equals() and hashCode()

But I am looking for a practical solution, hopefully in the form of a tested library. I have a situation which is suitable for perfect hashing: essentially, we can assume that the set of keys is fixed, and the program runs for a long time and does a lot of lookups. (This is not exactly true, but keys are added rarely enough that it is a close enough approximation, and if I have to re-hash periodically or something to deal with that, that's OK).

Basically, I would like to be able to increase the load factor and also reduce collisions. In other words, the objectives are to reduce memory use, and increase throughput (i.e. number of lookups per second).

There are some issues. Obviously, there is the problem that if hashCode() doesn't return distinct values, then perfect hashing is impossible. And there are other considerations besides the hashing algorithm, like the complexity of hashCode() (or whether I should cache the hashcodes on the key objects, etc) or whatever other function I use to initially map my objects to integers or longs.

What I am envisioning is being able to re-hash in a background thread, trying different hash functions to find a perfect one or at least a good one. I'm open to another solution though. And I would like to use tested code rather than write it myself, though I am open to that too.

Community
  • 1
  • 1
mrip
  • 14,913
  • 4
  • 40
  • 58
  • Wouldn't the `hashCode()` implementation depend on knowing more about your `key` type? What's 'perfect' for one use case might not be perfect in general. – Always Learning Apr 20 '15 at 15:55
  • Yes, but that's kind of a separate question. Even if `hashCode()` provides unique values for each key, there can still be collisions because the size of the hashtable will (usually) be smaller than the number of unique integers. – mrip Apr 20 '15 at 15:59
  • Also: http://remis-thoughts.blogspot.co.uk/2012/03/perfect-hashes-in-java-given-set-of-m.html – Nicholas White Dec 23 '17 at 09:41

3 Answers3

2

You don't need perfect hashing if your data is sufficiently random. Mitzenmacher has a neat paper explaining both why perfect hashing is hard in practice, and why it's (usually) unnecessary in practice. I'll give you a link and paste in the header so you can find it if the link vanishes.

http://people.seas.harvard.edu/~salil/research/streamhash-Jun10.pdf

Why Simple Hash Functions Work: Exploiting the Entropy in a Data Stream

Michael Mitzenmacher Salil Vadhan School of Engineering & Applied Sciences

June 23, 2010

Hashing is fundamental to many algorithms and data structures widely used in practice. For theoretical analysis of hashing, there have been two main approaches. First, one can assume that the hash function is truly random, mapping each data item independently and uniformly to the range. This idealized model is unrealistic because a truly random hash function requires an exponential number of bits to describe. Alternatively, one can provide rigorous bounds on performance when explicit families of hash functions are used, such as 2-universal or O(1)-wise independent families. For such families, performance guarantees are often noticeably weaker than for ideal hashing.

In practice, however, it is commonly observed that simple hash functions, including 2-universal hash functions, perform as predicted by the idealized analysis for truly random hash functions. In this paper, we try to explain this phenomenon. We demonstrate that the strong performance of universal hash functions in practice can arise naturally from a combination of the randomness of the hash function and the data. Specifically, following the large body of literature on random sources and randomness extraction, we model the data as coming from a “block source,” whereby each new data item has some “entropy” given the previous ones. As long as the (Renyi) entropy per data item is sufficiently large, it turns out that the performance when choosing a hash function from a 2-universal family is essentially the same as for a truly random hash function. We describe results for several sample applications, including linear probing, balanced allocations, and Bloom filters.

Dave
  • 7,460
  • 3
  • 26
  • 39
0

I don't see why you want to re-hash in a background thread. What will guarantee that the new hashtable has lesser collisions? Maybe if you search for collosion and re-hash those with an other hash function. But what if some of the new hash codes are still in the table? Re-hash until the number of collisions are zero? Nothing will guarantee you won't have collosions. See the bithday problem for proof: http://en.wikipedia.org/wiki/Birthday_problem.

I think you need a good hash function which have a good collision-resistance. I share you my research. I hope it helps!

The best collision-resistance hash function I found is carc32. With this function the probability of collisions between any of N object is (N - 1) / 2^32. Here the second post will teel you why. There is and another research which is strengthen that. There is a built-in in class for that is Java: CRC32

Community
  • 1
  • 1
Zsolt Mester
  • 1,053
  • 9
  • 14
-2

Use a cryptographic library like BouncyCastle to provide better hash functions. See Hash String via SHA-256 in Java.

Another option seemed to be something like http://www.anarres.org/projects/jperf/ , but I haven't tried it myself.

Community
  • 1
  • 1
JP Moresmau
  • 7,388
  • 17
  • 31
  • 1
    Those are not perfect hashes. – SLaks Apr 20 '15 at 15:51
  • Thanks for the links. JPerf looks promising, although some of the links seem to be dead at least from here. Cryptographic hashing is not really what I'm looking for. – mrip Apr 20 '15 at 16:04