56

Supposing simple uniform hashing, that being, any given value is equally like to hash into any of the slots of the hash. Why is it better to use a table of size 127 and not 128? I really don't understand what's the problem with the power of 2 numbers. Or how it actually makes any difference at all.

When using the division method, we usually avoid certain values of m (table size). For example, m should not be a power of 2, since if m = 2^p , then h(k) is just the p lowest-order bits of k.

Let's suppose the possible elements are only between 1 and 10000 and I picked the table size as 128. How can 127 be better? So 128 is 2^6 (1000000) and 127 is 0111111. What difference does this make? All numbers (when hashed) are still going to be the p lowest-order bits of k for 127 too. Did I get something wrong?

I'm looking for some examples as I really can't understand why is this bad. Thanks a lot in advance!

PS: I am aware of: Hash table: why size should be prime?

Community
  • 1
  • 1
Clash
  • 4,896
  • 11
  • 47
  • 67
  • 2
    `> PS: I am aware of: Hash table: why size should be prime?` - then read it again, or link through to [this one](http://stackoverflow.com/questions/1145217/why-should-hash-functions-use-a-prime-number-modulus/1147232#1147232) – sehe May 08 '11 at 20:25
  • 1
    @sehe The thread you linked makes a supposition that the elements inside have a relationship ("Then if a bunch of strings all having the same first char are fed in, then the results will all be the same modulo k") – Clash May 08 '11 at 21:23
  • @Clash: Sorry, but if you insist that it is not necessary to optimize against collisions for your specific hash, you might be confusing indexing with hashing. A perfect hash can be used as an index, but all possible values have to be known up front. With such a configuration it doesn't matter even if the number of buckets is actually a factorial (`n!`). But that is not the generic science behind hashing. – sehe May 08 '11 at 21:31
  • 3
    OT: `Clash` is a very nice screen name to use when talking about hash collisions :) – sehe May 08 '11 at 21:31
  • @sehe, I'm not insisting that I don't have collisions. I'm just trying to understand why a prime, even though smaller than a power of 2, is better than a power of two. The link you gave me, refers to a situation where a particular set of elements is more probable of happening. Thanks for your replies! – Clash May 09 '11 at 07:10
  • possible duplicate of [Why setting HashTable's length to a Prime Number is a good practice ?](http://stackoverflow.com/questions/5152015/why-setting-hashtables-length-to-a-prime-number-is-a-good-practice) – Brian May 09 '11 at 14:16
  • 1
    Because real data is almost never uniformly distributed. If you hash strings using 128, you'll get 26 buckets filled unevenly and the rest empty. If you use 127 you'll probably get them all filled more evenly. – phkahler May 09 '11 at 19:36
  • Just correcting a typo: 128 is 2^7, not 2^6. – TT_ stands with Russia Nov 26 '13 at 20:45

9 Answers9

22

All numbers (when hashed) are still going to be the p lowest-order bits of k for 127 too.

That is wrong (or I misunderstood..). k % 127 depends on all bits of k. k % 128 only depends on the 7 lowest bits.


EDIT:

If you have a perfect distribution between 1 and 10,000. 10,000 % 127 and 10,000 % 128 both will turn this in a excellent smaller distribution. All buckets will contain 10,000 /128 = 78 (or 79) items.

If you have a distribution between 1 and 10,000 that is biased, because {x, 2x, 3x, ..} occur more often. Then a prime size will give a much, much better distribution as explained in this answer. (Unless x is exactly that prime size.)

Thus, cutting off the high bits (using a size of 128) is no problem whatsoever if the distribution in the lower bits is good enough. But, with real data and real badly designed hash functions, you will need those high bits.

Community
  • 1
  • 1
Ishtar
  • 11,542
  • 1
  • 25
  • 31
  • You are right Ishtar. But this is equivalent to saying that any multiple of 128 % 128 (the higher order bits are always multiples of 128) is going to be 0, which for me, is obvious. 127 on the other hand does not have this property, but there will be even more multiples of 127, so this should be even worse, shouldn't it? I don't understand the problem with ignoring the higher bits. – Clash May 08 '11 at 21:14
  • 1
    @Clash - The real problem in ignoring the higher bits is that people write lousy hash functions. So if your table needs a good distribution it would be silly to ignore those extra bits of effort. Making good hashes is hard, so with a prime size you're just being tolerant. – Ishtar May 08 '11 at 21:29
  • 2
    @Clash: the problem with ignoring the higher bits is that it's normal for a given data set to vary in only some bits. (E.g., a bunch of string variables representing paths might agree on the first dozen characters. Or, ages might agree on all but the bottom 6 bits.) If those are the bits you're throwing out, you're going to have a lot of collisions. – Neil G May 08 '11 at 23:35
5

Division Method

"When using the division method, we usually avoid certain values of m (table size). For example, m should not be a power of 2, since if m = 2p , then h(k) is just the p lowest-order bits of k."

--CLRS

To understand why m = 2p uses only the p lowest bits of k, you must first understand the modulo hash function h(k) = k % m.

The key can be written in terms of a quotient q, and remainder r.

k = nq + r

Choosing the quotient to be q = m allows us to write k % m simply as the remainder in the above equation:

k % m = r = k - nm,  where r < m

Therefore, k % m is equivalent to continuously subtracting m a total of n times (until r < m):

k % m = k - m - m - ... - m,  until r < m

Lets try hashing the key k = 91 with m = 24 = 16.

  91 = 0101 1011
- 16 = 0001 0000
----------------
  75 = 0100 1011
- 16 = 0001 0000
----------------
  59 = 0011 1011
- 16 = 0001 0000
----------------
  43 = 0010 1011
- 16 = 0001 0000
----------------
  27 = 0001 1011
- 16 = 0001 0000
----------------
  11 = 0000 1011

Thus, 91 % 24 = 11 is just the binary form of 91 with only the p=4 lowest bits remaining.


Important Distinction:

This pertains specifically to the division method of hashing. In fact, the converse is true for the multiplication method as stated in CLRS:

"An advantage of the multiplication method is that the value of m is not critical... We typically choose [m] to be a power of 2 since we can then easily implement the function on most computers."

Community
  • 1
  • 1
bcorso
  • 45,608
  • 10
  • 63
  • 75
3

First off, it's not about picking a prime number. For your example, if you know your data set will be in the range 1 to 10,000, picking 127 or 128 won't make a difference bc it's a poor design choice.

Rather, it's better to pick a REALLY large prime like 3967 for your example so that each data will have its own unique key/value pair. You just want to also minimize collisions. Picking 127 or 128 for your example won't make a difference bc all 127/128 buckets will be uniformly filled (this is bad and will degrade the insertion and lookup run time O(1) to O(n)) as opposed to 3967 (which will preserve the O(1) run times)

EDIT #4

The design of the "hash function" is somewhat of a black art. It can be highly influenced by the data that's intended to be stored in the hashing-based data structure, so the discussion on a sensible hashing function can often stray into a discussion about specific inputs.

As why primes are "preferred", one has to consider an "adversary" analysis, that is suppose I designed a general hashing-based data structure, how would it perform given the worst input from an adversary. Since performance is dictated by hashing collisions the question becomes what's the hash to use that minimizes collision in the worst condition. One such condition is when the input are always numbers divisible by some integer, say 4. If you use N = 128 then any number divisible by 4 mod 128 is still divisible by 4, which means only buckets 4, 8, 12, ... are always ever used, resulting in 25% utilization of the data structure. Primes effectively reduces the likelihood of such scenario occurring, with numbers > N.

Matthew
  • 2,035
  • 4
  • 25
  • 48
  • Correct me if I'm wrong, but 3976 will have multiple values in each bucket. – Nick ODell May 08 '11 at 20:11
  • @Nick I think he read 1000. I know 127 and 128 are bad for 10000. What I want to understand is, why is it better to take a prime and not just any other number? Why is a power of 2 bad? Say then I picked 16384 (2^14). Why is 16381 better? Thanks – Clash May 08 '11 at 20:15
  • 1
    Sorry, typoe: i meant 3967. Well, it goes back to the design of the hash function. For now, if you assume a bare, hash function that just takes a number (btwn 1 and 10,000) and takes it modulo by 3967 it practically ensures we have zero collisions in the table. Also the large prime makes our table almost 4x larger and will ensure that collisions have a low probability of happening – Matthew May 08 '11 at 20:15
  • Do you mean, that the table size should be bigger than the range of possible the hash values? A size of 127/128 with hash function 1-10,000 is fine if you store 50 elements. – Ishtar May 08 '11 at 20:40
  • 2
    I don't see why 127 is "small" and 3967 is "really big". All that matters is the _load factor_. If you're storing 10 elements, 127 is perfectly fine, and will probably incur fewer cache misses. – Neil G May 08 '11 at 20:46
  • I'm not going to go into load factor/balancing, but i do want to add more about why we use primes instead of power of 2's for hashing (discussed with a friend): the reason we avoid a power of 2 is that hash functions are ultimately returning a binary result and if you use a power of two, you're basically chopping off the top half of the hash since binary is a power of two. Ideally you want every element in the hash value to play a significant role in the hash position, so by picking a prime number (or at least a # that isn't a power of two), you're doing a better job of using the whole hash. – Matthew May 08 '11 at 21:25
  • 1
    @mattkc7, what do you mean with "binary is a power of two"? I thought binary was simply another base for representing numbers. I also don't see how half of the hash is getting chopped off when a power of two is used. – Clash May 08 '11 at 21:31
  • @mattkc7: I don't buy that explanation. If you mod by 127, you're still not using the whole hash. You're losing all but (just less than) 7 bits of information. It's just a lot harder to see which bits of information you're losing. Do you think that modding by 3 is better than modding by 4? – Neil G May 09 '11 at 08:47
  • @Neil G + everyone else: when i first read this question, i thought it was more about "why primes" as opposed to "what's wrong with power's of 2's?" so i checked with friends at UW and UC Berkeley and present to you the final edit (done to the best of my abilities) i'm making above. I hope it helps clarifies why powers of 2's are not great candidates compared to primes! – Matthew May 10 '11 at 03:04
3

Nick is right that in general, the hash table size doesn't matter. However, in the special case where open addressing with double hashing is used (in which the interval between probes is computed by another hash function) then a prime number-sized hash table is best to ensure that all hash table entries are available for a new element (as Corkscreewe mentioned.)

Neil G
  • 32,138
  • 39
  • 156
  • 257
2

If you have a perfect hash function that has an even distribution, then it doesn't matter.

Nick ODell
  • 15,465
  • 3
  • 32
  • 66
  • 3
    If you don't, it might happen that a recursive collision appears, thus making it impossible for a certain item to be saved in the hashtable. With prime number size (or perfect hash function), this will not occur. – Corkscreewe May 08 '11 at 20:05
  • 3
    That would really depend on what the table does on a collision. – Nick ODell May 08 '11 at 20:08
  • My hash function is the modulus operator. This is not a perfect hash, is it? I actually haven't reached perfect hashing yet, but from what I have read this has more to do in the fact that no new key is being inserted, the elements are static. – Clash May 08 '11 at 20:31
  • @Clash, that's a pretty bad hash function if you are modding by the table size because higher-order bits are not used in the hash function. Why don't you just copy std::hash? – Neil G May 08 '11 at 20:38
  • @Neil, this is what I'm trying to understand. How is using a prime, that's close to a prime of two, or any other number near a power of two, better than a power of two? BTW: There is no hash in std as far as I know. There is std::map, but I think inside it works as a binary tree (could be wrong) – Clash May 08 '11 at 20:58
  • @Clash: Did you see my answer? Also, std::hash was added in C++0x, so you can find the code online if you don't already have it installed. – Neil G May 08 '11 at 21:05
  • I did see your answer. The book is using the modulus operator as a hash function for this example, not double hashing. – Clash May 08 '11 at 21:17
  • @Clash: Double hashing has nothing to do with the hash function; it has to do with how collisions are resolved. That is part of your implementation of the hash table. It might be helpful to read the Wikipedia article on hash tables. – Neil G May 08 '11 at 21:55
  • @Neil G, Thanks for the explanation, but it's not using double hashing either, the slot to be used is the result of the hash function, that is simply the modulus operator – Clash May 09 '11 at 07:11
2

Wikipedia actually has a good summary of this:

http://en.wikipedia.org/wiki/Hash_table

They point out that some hash functions are designed to operate ONLY with prime numbers. This article explains why powers of two are bad:

http://www.concentric.net/~Ttwang/tech/primehash.htm

0

I believe that it just has to do with the fact that computers work with in base 2. Something similar happens with base 10.

...

Picking a big enough, non-power-of-two number will make sure the hash function really is a function of all the input bits, rather than a subset of them.

From Why hash tables should use a prime-number size.

Ste_95
  • 361
  • 3
  • 15
0

here is a way to understand " k % 127 depends on all bits of k. k % 128 only depends on the 7 lowest bits." .
k % 128 is equals to k & (2^7-1) .for example: 129 % 128 = 1 , In Binary: 1000 0001 & 0111 1111 =0000 0001,any hight bit of (2^7-1) will be 0 ,which means it dose not matter whats the high position is. but this translate is invalid for numbers which are not equals 2^n.
now let's take a look at how we do division in Decimal 129 % 127 ,first look at the highest position 1,less than 127,then we get the next item 2 combine with fist we get 12 , 12 is less than 127,then combine with 9 which means 129 ,divided by 127 the remainder is 2,we could write this in math:129 = 1 * 127 +2 ,so we got 2 [all of this is called Long_division] ,and it's the same in Binary division,now ,we know k % 127 depends on all bits of k

paxi
  • 21
  • 1
  • 5
0

I cannot prove it anymore, although I remember having to do so in an exam at university a million years ago, but optimal hash sizes are not merely prime. You want to pick a prime number N such that N = 4*M − 1 (where M is also an integer).

That makes 31 a better number of buckets than 29. M is 8 when N is 31, but there is no integral M when N is 29.

As I said, I no longer remember the math to prove this. It was in a theory course taught by Rachel Manber, Udi’s wife, about 25 years ago or so.

tchrist
  • 78,834
  • 30
  • 123
  • 180