2

I'm looking for the most effective way of creating hashcodes for a very specific case of strings.

I have strings that can be converted to integer, they vary from 1 to 10,000, and they are very concentrated on the 1-600 range.

My question is what is the most effective way, in terms of performance for retrieving the items from a collection to implement the hashcode for it.

What I'm thinking is:

  • I can have the strings converted to integer and use a direct acess table (an array of 10.000 rows) - this will be very fast for retrieving but not very smart in terms of memory allocation;

  • I can use the strings as strings and get a hashcode for it (i wont have to convert it to integer, but i dont know how effective will be the hashcode for the strings in terms of collisions)

Any other ideas are greatly appreciated.

thanks a lot

Thanks everyone for your promptly replies...

There is another information Tha i've forget to add on this. I tink it Will Make this clear if I let you know my final goal with this-I migh not even need a hash table!!!

I just want to validate a stream against a dictiory that is immutable. I want to check if a given tag might or might not be present on my message.

I will receive a string with several pairs tag=value. I want to verify if the tag must or must not be treated by my app.

  • 2
    Use the `hashCode` method of String? It's a pretty good algorithm. – Hot Licks May 22 '12 at 20:25
  • @HotLicks But what about the hashcode spread? OP has a legitimate issue. Performance of the native hashCode might be another issue if the case is so constrained. There could be more performant alternatives. – Marko Topolnik May 22 '12 at 20:27
  • @Marko, why do you think the hash codes will not already be evenly spread? – Kirk Woll May 22 '12 at 20:29
  • Like I said, `hashCode` for String is a pretty good algorithm. The other obvious choice is to use the integer value itself, but then you can't use a simple HashMap on String -- you have to either create your own wrapper objects or swizzle HashMap. – Hot Licks May 22 '12 at 20:29
  • @HotLicks It' a pretty good **general** algorithm. And swizzling in Java? There's no performant way of doing that on a final class. – Marko Topolnik May 22 '12 at 20:30
  • @KirkWoll Now it's out of context -- Hot Licks deleted his comment, but added another later with the same point -- use the parsed integer itself for hashCode. That may cause spreading issues since the range is narrow. – Marko Topolnik May 22 '12 at 20:34
  • @Marko, agreed. I would recommend against using the raw value of the integer as the hash code. Thought you were referring to the string version. – Kirk Woll May 22 '12 at 20:35
  • If the numbers are indeed concentrated in the 1-600 range, and you make a hashtable with several hundred buckets, the hashing would be pretty good using the number itself. At most you'd have 2-3 entries per bucket. But, of course, you have to create your own hashtable. – Hot Licks May 22 '12 at 21:06

4 Answers4

1

You might want to consider a trie (http://en.wikipedia.org/wiki/Trie) or radix tree (http://en.wikipedia.org/wiki/Radix_tree). No need to parse the string into an integer, or compute a hash code. You're walking a tree as you walk the string.

Edit:

Both computing a hash code on a string and parsing an integer out of a string involve walking the entire the string, and THEN using that value as a look-up into a specific data structure. Other techniques might involve simultaneously inspecting the string WHILE traversing a data structure. This MIGHT be of value to the poster who asked for "other ideas".

pamphlet
  • 2,054
  • 1
  • 17
  • 27
1

Many collections (e.g. HashMap) already apply a supplemental "rehash" method to help with poor hashcode algorithms. e.g. browse the cource code for HashMap.hash(). And Strings are very common keys, so you can be sure that String.hashCode() is highly optimized. SO, unless you notice a lot of collisions between your hashCodes, I'd go with the standard code.

I tried putting the Strings for 0..600 into a HashSet to see what happened, but it's then pretty tedious to see how many entries had collisions. Look for yourself! If you really really care, copy the source code from HashMap into your own class, edit it so you can get access to the entries (in the Java 6 source code I'm looking at, that would be transient Entry[] table, YMMV), and add methods to count collisions.

user949300
  • 15,364
  • 7
  • 35
  • 66
  • Create a boolean array of the size of the presumed hash table, hash your strings, and set the corresponding bit. If the bit is already set you have a collision. You can go the next step with an array of ints and find the longest chain. – Hot Licks May 22 '12 at 21:13
0

If there are only a limited valid range of values, why not represent the collection as a int[10000] as you suggested? The value at array[x] is the number of times that x occurs.

If your strings are represented as decimal integers, then parsing them to strings is a 5-iteration loop (up to 5 digits) and a couple of additions and subtractions. That is, it is incredibly fast. Inserting the elements is effectively O(1), retrieval is O(1). Memory required is around 40kb (4 bytes per int).

One problem is that insertion order is not preserved. Maybe you don't care.

Maybe you could think about caching the hashcode and only updating it if your collection has changed since the last time hashcode() was called. See Caching hashes in Java collections?

Community
  • 1
  • 1
sjr
  • 9,769
  • 1
  • 25
  • 36
0

«Insert disclaimer about only doing this when it's a hot spot in your application and you can prove it»

Well the integer value itself will be a perfect hash function, you will not get any collisions. However there are two problems with this approach:

  1. HashMap doesn't allow you to specify a custom hash function. So either you'll have to implement you own HashMap or you use a wrapper object.
  2. HashMap uses a bitwise and instead of a modulo operation to find the bucket. This obviously throws bits away since it's just a mask. java.util.HashMap.hash(int) tries to compensate for this but I have seen claims that this is not very successful. Again we're back to implementing your own HashMap.

Now that this point since you're using the integer value as a hash function why not use the integer value as a key in the HashMap instead of the string? If you really want optimize this you can write a hash map that uses int instead of Integer keys or use TIntObjectHashMap from trove.

If you're really interested in finding good hash functions I can recommend Hashing in Smalltalk, just ignore the half dozen pages where the author rants about Java (disclaimer: I know the author).

Philippe Marschall
  • 4,452
  • 1
  • 34
  • 52