3

Let the size of the hash table to be static (I set it once). I want to set it according to the number of entries. Searching yielded that the size should be a prime number and equal to 2*N (the closest prime number I guess), where N is the number of entries.

For simplicity, assume that the hash table will not accept any new entries and won't delete any.

The number of entries will be 200, 2000, 20000 and 2000000.

However, setting the size to 2*N seems too much to me. It isn't? Why? If it is, which is the size I should pick?

I understand that we would like to avoid collisions. Also I understand that maybe there is no such thing as ideal size for the hash table, but I am looking for a starting point.

I using C and I want to build my own structure, for educating myself.

gsamaras
  • 71,951
  • 46
  • 188
  • 305
  • I don't know the optimal size, but keep in mind that you will have hash collisions. I suspect that the ideal number of buckets will have something to do with the size of the hash space, and therefore probability of hash collisions. If N is large, 2*N seems an excessive use of memory. If N is small, hash collisions will be very rare, so 2*N also seems wasteful. – Eric J. Oct 13 '14 at 21:38
  • 2
    "for educating myself" - then experiment with a range of sizes and plot the consequent operations per second to understand the tradeoffs. – Tony Delroy Oct 14 '14 at 08:27
  • @TonyD yes, but I would like some guidance about the sizes I should pick. – gsamaras Oct 14 '14 at 10:21
  • 1
    Sure... `for (double load_factor = 0.5; load_factor <= 2.0; load_factor += 0.05) { size_t buckets = n / load_factor;` ...benchmark... `}`. If you don't chain colliding elements off your buckets (i.e. you're not using a list per bucket, instead having an algo to try alternative buckets), it will *fail* as it approaches /reaches 1.0, otherwise it can continue indefinitely but I think you'll get a good indication of the curve from the range above, but if you feel you still don't know where it's going extend to 10 or 100 (perhaps increasing the step size to keep the runtime reasonable). – Tony Delroy Oct 14 '14 at 10:41
  • 1
    I will use lists per bucket to avoid collision, there is no doubt about that. I like your approach. – gsamaras Oct 14 '14 at 10:44
  • Just to clarify something: in my answer I didn't mean to disfavour the power of 2 size. Far from it: my point is that in general, you may as well use a power of 2, because a secondary hash function (built into the hash table implementation) can alleviate the problem of a poor hash function. However, the rationale of the relative performance of a division vs AND instruction is probably questionable on modern architectures. (You'd at least want to *demonstrate* it was a major performance factor rather than assuming it.) – Neil Coffey Oct 15 '14 at 00:11

2 Answers2

3

the size should be a prime number and equal to 2*N (the closest prime number I guess), where N is the number of entries.

It certainly shouldn't. Probably this recommendation implies that load factor of 0.5 is good tradeoff, at least by default.

What comes to primality of size, it depends on collision resolution algorithm your choose. Some algorithms require prime table size (double hashing, quadratic hashing), others don't, and they could benefit from table size of power of 2, because it allows very cheap modulo operations. However, when closest "available table sizes" differ in 2 times, memory usage of hash table might be unreliable. So, even using linear hashing or separate chaining, you can choose non power of 2 size. In this case, in turn, it's worth to choose particulary prime size, because:

If you pick prime table size (either because algorithm requires this, or because you are not satisfied with memory usage unreliability implied by power-of-2 size), table slot computation (modulo by table size) could be combined with hashing. See this answer for more.

The point that table size of power of 2 is undesirable when hash function distribution is bad (from the answer by Neil Coffey) is impractical, because even if you have bad hash function, avalanching it and still using power-of-2 size would be faster that switching to prime table size, because a single integral division is still slower on modern CPUs that several of multimplications and shift operations, required by good avalanching functions, e. g. from MurmurHash3.

The entries will be 200, 2000, 20000 and 2000000.

I don't understand what did you mean by this.

However, setting the size to 2*N seems too much to me. It isn't? Why? If it is, which is the size I should pick?

The general rule is called space-time tradeoff: the more memory you allocate for hash table, the faster hash table operate. Here you can find some charts illustrating this. So, if you think that by assigning table size ~ 2 * N you would waste memory, you can freely choose smaller size, but be ready that operations on hash table will become slower on average.

I understand that we would like to avoid collisions. Also I understand that maybe there is no such thing as ideal size for the hash table, but I am looking for a starting point.

It's impossible to avoid collisions completely (remember birthday paradox? :) Certain ratio of collisions is an ordinary situation. This ratio affects only average operation speed, see the previous section.

Community
  • 1
  • 1
leventov
  • 14,760
  • 11
  • 69
  • 98
  • When collisions occur, I will simple have a list in the bucket and search it via brute force. "The entries will be 200, 2000..." means that the number of entries will be 200 etc.. I understand the trade-off. However, as the other answer you also don't suggest a size for my hash table. Is it because I should first pick the hash function? – gsamaras Oct 14 '14 at 10:19
  • @G.Samaras practically first pick by default should be power of 2 size, closest to 2 * N. Use prime size only if you really need better control over memory usage. – leventov Oct 14 '14 at 12:46
  • There was an interesting side note in the linked article with performance graphs. Once the memory allocated for the hash table is larger than CPU cache, there was no benefit to additional memory size because algorithmic gains tended to be offset by CPU cache misses. – Eric J. Oct 14 '14 at 15:25
1

The answer to your question depends somewhat on the quality of your hash function. If you have a good quality hash function (i.e. one where on average, the bits of the hash code will be "distributed evenly"), then:

  • the necessity to have a prime number of buckets disappears;
  • you can expect the number of items per bucket to be Poisson distributed.

So firstly, the advice to use a prime number of buckets is is essentially a kludge to help alleviate situations where you have a poor hash function. Provided that you have a good quality hash function, it's not clear that there are really any constraints per se on the number of buckets, and one common choice is to use a power of two so that the modulo is just a bitwise AND (though either way, it's not crucial nowadays). A good hash table implementation will include a secondary hash to try and alleviate the situation where the original hash function is of poor quality-- see the source code to Java's HashTable for an example.

A common load factor is 0.75 (i.e. you have 100 buckets for every 75 entries). This translates to approximately 50% of buckets having just one single entry in them-- so it's good performance-wise-- though of couse it also wastes some space. What the "correct" load factor is for you depends on the time/space tradeoff that you want to make.

In very high-performance applications, a potential design consideration is also how you actually organise the structure/buckets in memory to maximise CPU cache performance. (The answer to what is the "best" structure is essentially "the one that performs best in your experiments with your data".)

Neil Coffey
  • 21,615
  • 7
  • 62
  • 83
  • I haven't decided which my hash function will be. I think that the 1st link can help me decide. I partially understand your answer. For example, if we had a perfect hash function, then we the size of the hash table should be N (one item per bucket). The Poisson distribution has to do with how the items are going to be inserted? I mean it has to do with the chance that an item has to be inserted in a certain bucket? As for the load factor, will I have to somehow put that in the equation too? If you were me, how would you set the size of the table? I am not interested now in using a 2nary table. – gsamaras Oct 13 '14 at 22:08
  • No, a "perfect" hash function is one that effectively allows us to consider "100 random numbers" to represent 100 randomly chosen entries. So imagine that you have 100 buckets and you pick 100 random numbers in the range 1-100, i.e. a thought experiment equivalent to having the same number of buckets as items. In reality, after picking 100 random numbers, you wouldn't expect each number exactly once; rather there'll be some numbers that don't come up at all, and others that are duplicated 2 or 3 times... – Neil Coffey Oct 13 '14 at 22:15
  • ...so the "ideal" of having every single bucket filled exactly once with no "wastage" simply isn't a realistic goal to cope with a general case. This isn't do with your hash function being imperfect; rather, the situation that you statistically expect with a "perfect" hash function is to actually have some wastage if you want to avoid having many items in any given bucket. – Neil Coffey Oct 13 '14 at 22:18
  • Actually a good programming exercise you could try is the simulation I mention with random numbers: write a program that takes (say) 1000 random numbers in the range 1-1000 and counts how many times each number occurs and thereby look at the distribution of "bucket lengths". Then repeat with 1000 numbers in the range 1-500 (simulating 1000 items in 500 buckets) and look at the distribution. Then 1000 numbers in the range 1-2000 etc. – Neil Coffey Oct 13 '14 at 22:21
  • I could do that. So, I should start with finding a nice hash function first and then try to think about the size of the hash table. – gsamaras Oct 13 '14 at 22:25
  • Yes, although you also don't need to be *super* paranoid about the hash function. If you can whack the data to be hashed into a buffer, then you can use something like this: http://www.javamex.com/tutorials/collections/strong_hash_code_implementation.shtml – Neil Coffey Oct 14 '14 at 01:33
  • @NeilCoffey "Table size primality is driven by quality of hash function" is wrong premise I think. See my answer. – leventov Oct 14 '14 at 06:06
  • Both answers are nice, but I am accepting the other one, since the guy has less reputation. – gsamaras Oct 14 '14 at 16:42