20

Some hash table schemes, such as cuckoo hashing or dynamic perfect hashing, rely on the existence of universal hash functions and the ability to take a collection of data exhibiting collisions and resolve those collisions by picking a new hash function from the family of universal hash functions.

A while ago I was trying to implement a hash table in Java backed by cuckoo hashing and ran into trouble because while all Java objects have a hashCode function, the value that hashCode returns is fixed for each object (unless, of course, the objects change). This means that without the user providing an external family of universal hash functions, it's impossible to build a hash table that relies on universal hashing.

Inititially I thought that I could get around this by applying a universal hash function to the object's hashCodes directly, but this doesn't work because if two objects have the same hashCode, then any deterministic function you apply to those hash codes, even a randomly-chosen hash function, will result in the same value and thus cause a collision.

It seems like this would be detrimental to Java's design. It means that HashMap and other hash containers are completely prohibited from using tables based on universal hashing, even if the language designers may think that such tables would be appropriate in the language design. It also makes it harder for third-party library designers to build hash tables of this sort as well.

My question is: is there a reason that Java opted to design hashCode without considering the possibility of hashing objects with multiple hash functions? I understand that many good hashing schemes like chained hashing or quadratic probing don't require it, but it seems as though the decision makes it hard to use certain classes of algorithms on Java objects.

GEOCHET
  • 21,119
  • 15
  • 74
  • 98
templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
  • 2
    How would this be implemented? Requiring everyone to implement a dozen hash functions? Hash tables are much easier to get working by simply detecting collisions and working around them. –  Mar 05 '11 at 11:02
  • @delnan- I was thinking more along the lines of `hashCode` taking in some bitstring that it could then interpret when generating the hash function. For example, if you were hashing integers, you could treat the bitstring as a series of 8-bit values with which you could scale up the individual octets in the number. – templatetypedef Mar 05 '11 at 11:05
  • Cuckoo hashing doesn't depend on universal hash functions, there are multiple simple variations that get around this. The most trivial is when the hash table can be simply grown (which is typically the case in Java applications). – MSalters Jun 01 '12 at 12:20
  • @MSalters- Can you elaborate on this? I was under the impression that universal hashing was necessary, since if two Java objects have the same hash code they will collide no matter how big the table is. I can't find any sources suggesting otherwise. – templatetypedef Jun 01 '12 at 14:54
  • I've always thought about universal hashing as something you need to make proofs, not to make systems. I'd love to see you come up with an example where you need multiple different hash functions, but a salted version of cityhash is inadequate, and a universal hash function works. – Rob Neuhaus Jun 01 '12 at 16:11
  • @rrenaud- A universal hash function is just a set of hash functions that (on expectation) makes each pair of different elements hash to different buckets. It's quite possible that a salted hash would be a universal family of hash functions. – templatetypedef Jun 01 '12 at 16:41

2 Answers2

15

Simplicity. Java allows class designers to provide their own hashCode, which as you mention is good enough for "ordinary" hash tables, and can be hard enough to understand.

Besides, when the Java Collections API was designed, having generic hash tables in the standard library was bold enough a move already. C has never had them. C++ had them in the STL as hash_set and hash_map, but those didn't make it into the standard. Only now, in C++0x, are hash tables being considered for standardization again.

Community
  • 1
  • 1
Fred Foo
  • 355,277
  • 75
  • 744
  • 836
0

I think the normal hashCode method was created without the "malicious inputs" case in mind. Also, as written by larsmann, its contract is much more easier to understand and implement than a universal hash function would be.

Here an idea about what to do:

  • Use a map implementation relying on external hash-functions (like the HashableEquivalenceRelation I presented some hours ago here)
  • then use a universal family of such implementations (or an implementation which allows changing the parameter to switch to another member of the family).
Community
  • 1
  • 1
Paŭlo Ebermann
  • 73,284
  • 20
  • 146
  • 210