2

I need some ideas to develop a good hashing function for my assignment. I have a list of all the countries in the world (around 190) in total. The names of each country is the key for the hashing function. Is there a specific kind of hashing function anyone would recommend to store this data in a hashing function without many collisions? Also, can you perhaps give an example of how to implement it?

Brian Roach
  • 76,169
  • 12
  • 136
  • 161
Eric
  • 21
  • 1
  • A similar question that might be of some help (java instead of c, but same principles apply) http://stackoverflow.com/questions/2624192/good-hash-function-for-strings – cobbal Apr 24 '11 at 23:27
  • 4
    ISO country codes - fits in an integer, guaranteed to be collision-free – Erik Apr 24 '11 at 23:27
  • 1
    @Erik: Very clever; if you made it an answer I'd upvote it. – John Zwinck Apr 24 '11 at 23:28
  • @John: I don't think that really answers the OPs question though, I'll leave it a comment :P – Erik Apr 24 '11 at 23:29
  • You could write a pretty basic hashing function then write a test script (in your favorite scripting language) to see if there are any collisions. If you only have ~200 items in your data set, I doubt you have much to worry about. – Chris Lutz Apr 24 '11 at 23:35
  • 1
    Are the country names in English and ASCII? – Mikel Apr 25 '11 at 00:00

3 Answers3

2

Use GNU gperf. For inputs like yours, it will generate C code for you which implements a perfect hash function (for the given inputs). No collisions, no worries.

John Zwinck
  • 239,568
  • 38
  • 324
  • 436
2

You can use generated perfect hash for that (GNU perf).

Of if the set of strings is dynamic then you can use ternary trie. For N unique strings it will give you unique number [1..N]. For your case it will be faster than with hash tables. Here is my implementation of such thing: http://code.google.com/p/tiscript/source/browse/trunk/tool/tl_ternary_tree.h

c-smile
  • 26,734
  • 7
  • 59
  • 86
  • 1
    You should also point to [tries](http://en.wikipedia.org/wiki/Trie), the ternary tree's easier to implement cousin. – hugomg Apr 25 '11 at 00:26
0

The simplest approach I can think of is for each country's name to compute the sum of the ASCII values in its representation and use this as the hash value:

int hash(const char *s)
{
    int h = 0;

    while (s && *s)
          h += *s++;

    return h;
}

If your hash map has size N, you store country names with map[hash(my_country) % N] = my_country. Conceptually.

Just try this approach and see whether the resulting hash values are sufficiently uniformly distributed. Note that the quality of the distribution may also depend on N.

Philip
  • 5,795
  • 3
  • 33
  • 68