10

Suppose I have 200.000 of words, and I am going to use hash*33 + word[i] as a hash function, what should be the size of table for optimization, for minimum memory/paging issue?

Platform used - C (c99 version),

words are English char words, ASCII values

One time initialization of hash table (buckets of link list style),

used for searching next, like dictionary search.

After collision , that word will be added as new node into bucket.

amitfreeman
  • 165
  • 1
  • 3
  • 12
  • There are still too many open variables to provide an answer: platform used, frequency of updates and reads, definition of word (i.e. 32-bit value or English word) – Peter G. Mar 30 '14 at 08:46
  • How are you going to resolve collisions? Will you have a list of words with the same hash or will you place the word into another cell? – Marian Mar 30 '14 at 08:49
  • @PeterG., now updated – amitfreeman Mar 30 '14 at 08:54
  • 1
    This is really something to benchmark. It would depend on the exact distribution of the words at any time (which we'd need an exact list of the words / hash values in order of insertion / deletion to determine), the exact specifications of the machine used, the code and possibly even the compiler. – Bernhard Barker Mar 30 '14 at 12:28
  • @Dukeling I wanted the rough idea about how its calculated. Anyways, I used getrusage to compute timings of insertion, searching & deletion, & with trial & n error method I settled on size of 51200 for 200000 words. 3-4 nodes in 1 bucket. It's pretty fast! – amitfreeman Mar 30 '14 at 14:25

1 Answers1

18

A good rule of thumb is to keep the load factor at 75% or less (some will say 70%) to maintain (very close to) O(1) lookup. Assuming you have a good hash function.

Based on that, you would want a minimum of about 266,700 buckets (for 75%), or 285,700 buckets for 70%. That's assuming no collisions.

That said, your best bet is to run a test with some sample data at various hash table sizes and see how many collisions you get.

You might also consider a better hash function than hash*33 + word[i]. The Jenkins hash and its variants require more computation, but they give a better distribution and thus will generally make for fewer collisions and a smaller required table size.

You could also just throw memory at the problem. A table size of 500,000 gives you a minimum load factor of 40%, which could make up for shortcomings of your hash function. However, you'll soon reach a point of diminishing returns. That is, making the table size 1 million gives you a theoretical load factor of 20%, but it's almost certain that you won't actually realize that.

Long story short: use a better hash function and do some testing at different table sizes.

There is such a thing as a minimal perfect hash. If you know what your input data is (i.e., it doesn't change), then you can create a hash function that guarantees O(1) lookup. It's also very space efficient. However, I don't know how difficult it would be to create a minimal perfect hash for 200,000 items.

Jim Mischel
  • 131,090
  • 20
  • 188
  • 351