1

I'm writing a program that will be able to solve word puzzles. Essentially I'm taking in a dictionary through an Infile.txt and creating a Hash table with it. I'm going to use separate chaining, and the java LinkedList class as the second level of the hash table (using a simple array that will point to linked lists). Feel free to suggest a better solution, as I am a novice data structurer. After taking in the dictionary I will be searching the hash table for words against a list of jumbled strings from an infile. I'm not worried about the searching at this point.

The dictionary is 109530 in size. That is the constant size of the input data. What would you say is the best size for the hash table? I've read conflicting things about this so I thought I'd ask here, so please explain your reasoning a bit.

Lastly, I'm going to use the following function as the hash function:

Hash(string) = ( SumOf(AsciiValOfChar() * CharPosInString()) ) mod TableSize;

Example: the string "abc" will be 97(ascii value of 'a') * 1 + 98 * 2 + 99 * 3 mod tablesize. So if table size is 10 "abc" will = 0 = 590 mod 0.

Any thoughts about this hash function?

Thanks a lot guys, your time is highly appreciated.

EDIT: I'm not using the Java hastable / hashmap class, rather I need to write my own. This is an exercise.

Dmitry
  • 13,797
  • 6
  • 32
  • 48
  • 1
    Have you already tried your approach? I mean, have you created a HashTable and fill it with all the words? That would be a good thing to do if you want to test how your hash function behaves and if it needs adjustments. – alejo Apr 04 '14 at 21:14
  • There are many different considerations when it comes to designing a hashtable scheme. But as a general rule, and assuming you do not use a secondary hash for "synonyms", you want to keep the number of "synonyms" per bucket to a minimum (1 is ideal) while at the same time not wasting space on empty buckets. To a certain degree this depends on how well distributed your hash algorithm is, and it also depends on what your overflow strategy is, but a good guess at an "optimal" table size is half the total number of entries. – Hot Licks Apr 04 '14 at 21:44

1 Answers1

1

tldr; 1) use >= 109530 * 1.33 as the final capacity, and 2) the hash function "will work", even if not ideal


The bucket count selection depends upon the particular hash table implementation used, data, and hash function quality.

Since there are many factors in play and my recommendation is to simply write the hash table such that it can regrow/resize as appropriate. Simply provide configuration options to control the initial capacity, fill-factor (0.75 is a good start), and grow-factor (doubling is a good start). The hash table can then be trivially tuned after running some tests.

Using powers of two for the bucket size effectively causes the remainder operation to be "reduced to masking [and can] increase problems with a poor hash function" which is why it is sometimes recommended against. However, the keyword is "poor hash function", and some implementations require a power of two and use an internal hash to mitigate this situation. Since this is a simple implementation just choose an odd number of sufficient magnitude, such as the power to two minus one.

As far as the hash function itself, there are several goals such as providing uniform distribution and avoiding clustering. However, the suggested hash does not do this very well, especially for small or similar strings. Such a hash will still work - even if it causes more collisions/clustering than a better counterpart.

Instead, consider Java's String.hashCode, which uses a compounding multiplier as it is applied against the previous hash value. (The .NET version is more complicated, but uses the similar idea of a compounding/running hash value.)

for (int i = 0; i < value.length; i++) {
    h = 31 * h + val[i];
}

The multiplier of 31 isn't the only "good" value, but it was chosen carefully - to avoid degenerate overflow characteristics, and because of a good bare-metal implementation.

(The modulus against the bucket count is not part of the hashing function.)

Community
  • 1
  • 1
user2864740
  • 60,010
  • 15
  • 145
  • 220