1

My question is not about double hashing technique http://en.wikipedia.org/wiki/Double_hashing , which is a way to resolve collisions. It is about handling existing collisions in hash table of strings. Say, we have a collision: several strings in the same bucket, so now we must go through the bucket checking the strings. It seems it would make sense to calculate another hash function for fast string comparison (compare hash values for quick rejection). The hash key could be lazily computed and saved with the string. Did you use such technique? Could you provide a reference? If not, do you think it's not worth doing since perfomance gain is questionable? Some notes:

  1. I put tag "Java" since I did measurements in Java: String.hashCode() in most cases outperforms String.equals() (and BTW greatly outperforms manual hash code calculation: hashCode = 31 * hashCode + strInTable.charAt(i));
  2. Of course, the same could be asked about any string comparison, not necessarily strings in a hash table. But I am considering a specific situation with huge amount of strings which are kept in hash table.
  3. This probably makes sense if the strings in the bucket are somewhat similar (like in Rabin-Karp algorithm). Looking for your opinion in general situation.
TT_ stands with Russia
  • 1,785
  • 3
  • 21
  • 33

2 Answers2

2

Many hash-based collections store the hash value of each item in the collection, on the premise that since every item's hash will have been computed when it was added to the collection, and code which is looking for an item in a hashed collection will have to know its hash, comparing hash values will be a quick and easy way of reducing the cost of false hits. For example, if one has a 16-bucket hash-table that contains four strings of 1,000 characters each, and will be searching for a lot of 1,000-character strings which match one of the table entries in all but the last few characters, more than 6% of of searches will hit on a bucket that contains a near-match string, but a much smaller fraction will hit a bucket that contains a string whose 32-bit hashCode matches that of the string being sought. Since comparisons of nearly-identical strings are expensive, comparing full 32-bit hash codes is helpful.

If one has large immutable collections which may need to be stored in hash tables and matched against other such collections, there may be some value in having such collections compute and cache longer hash functions, and having their equals methods compare the results of those longer hash functions before proceeding further. In such cases, computing a longer hash function will often be almost as fast as computing a shorter one. Further, not only will comparisons on the longer hash code greatly reduce the risks that false positives will cause needless "deep" comparisons, but computing longer hash functions and combining them into the reported hashCode() may greatly reduce the dangers of strongly-correlated hash collisions.

supercat
  • 77,689
  • 9
  • 166
  • 211
0

Comparing a hash only makes sense if the number of comparisons (lookups) is large compared to the number of entries. You would need a large hash (32 bits are not enough; you'd want at least 128 bits), and that would be expensive to calculate. You would want to amortize the cost of hashing each string over a large number of probes into the buckets.

As to whether it's worth it or not, it's highly context dependent. The only way to find out is to actually do it with your data and compare the performance of both methods.

Jim Garrison
  • 85,615
  • 20
  • 155
  • 190