What would be a good hash function for this list of English words?

Question

Currently, I am using a tablesize of 80, since I have about 73 words in the file. My current method of hashing is pretty basic and generic. I add up the ASCII value of the letters after I make them all lowercase, then I mod (%) by the tablesize (80 currently). I am getting a lot of collisions, and a lot of unused bucket/indexes. Since I know exactly which words I need to hash and how many, are there better methods to use, for the least possible collisions? My goal is to get 6 or less.

Also, side question. Once the words are in the hashtable, if I want to look up a certain word, but type that word incorrectly, or scrambled up, how would I find it in the hashtable?

For example, if I have "apple" in the hashtable, and for my search, I use "leppa", which is apple spells backward, whats a good way to unscramble "leppa" in such a way that, apple would come out?

Please ask me if you're unsure about what I just ask, sorry if I'm not clear!

Unfortunately I am not allowed to use that tool to come up with a "perfect hash function". I already did a bunch of research and also found that on Google, but its unfortunate that I can't use it for this project — The Newbie, May 16 '17 at 05:25
What you're describing in the second part is simply a spell checker or spell corrector (they're slightly different, but very similar). Try googling "spell checker algorithms" or something similar to get yourself started. [This stackoverflow question](http://stackoverflow.com/questions/2294915/what-algorithm-gives-suggestions-in-a-spell-checker) is another good place to start. — mgarey, May 16 '17 at 05:26

Pavel P · Answer 1 · 2017-05-16T05:47:44.300

3

Try md5, you won't have collisions in your dictionary.

You may simply use std::hash:

#include <string>
#include <iostream>
#include <functional>

int main()
{
    std::string str = "air conditioner";
    size_t h = std::hash<std::string>()(str);
    std::cout << "hash of \"" << str << "\" is " << h << std::endl;
}

commonly it might be implemented as fnv1 hash. Another good hash function is murmur. Check related question on stackexchange for other common hash functions.

edited May 16 '17 at 05:47

answered May 16 '17 at 05:21

Pavel P

15,789
11
79
128

How so? I have a hash table going from 0-79. If I MD5 a string, it would not give me an int from 0-79, so I can't store anything in the hash table? – The Newbie May 16 '17 at 05:26
@TheNewbie it's kind of pointless exercise as you can simply return index from your dictionary to get "perfect" hash function. – Pavel P May 16 '17 at 05:31
@Parvel, what do you mean? if there are collisions in my hashtable, how could it be perfect? – The Newbie May 16 '17 at 05:36
Using the FNV1 hash you mention, I'm getting access violation. Not sure if this is how you would implement it? `for (int i = 0; i < word.length(); i++) { hash = hash * 0x100000001b3; hash = hash ^ (int)word[i]; } index = hash % tableSize;` – The Newbie May 16 '17 at 05:41
@TheNewbie try `std::hash()(str) % tableSize` – Pavel P May 16 '17 at 05:45
Using the code you just commented, with a tablesize of 80, Im actually getting 4 collisions @ index 68, and 2-3 collisions in other buckets. Maybe I'm doing something wrong? – The Newbie May 16 '17 at 05:56
@TheNewbie I think it's very good result, I think there should be more collisions if you mod by 80 – Pavel P May 16 '17 at 05:58
I'm doing this without a for loop by the way. I'm throwing in what you gave me, into my hash function, which just takes the word, hash it, then mod by 80. I'll try to play around with the table size or something and see if its any better – The Newbie May 16 '17 at 06:01

score 3 · Answer 2 · answered May 16 '17 at 05:28

Murmur hash is considered fast and will probably give good distribution http://en.wikipedia.org/wiki/MurmurHash

In order to look for a "scrambled" text in a hash, you need to use hash-function that is agnostic to the letters order - pretty bad idea since all permutations will be in the same hash bucket

What would be a good hash function for this list of English words?

2 Answers2