11

Checking in java and googling online for hashtable code examples it seems that the resizing of the table is done by doubling it.
But most textbooks say that the best size for the table is a prime number.
So my question is:
Is the approach of doubling because:

  1. It is easy to implement, or
  2. Is finding a prime number too inefficient (but I think that finding the next prime going over n+=2 and testing for primality using modulo is O(loglogN) which is cheap)
  3. Or this is my misunderstanding and only certain hashtable variants only require prime table size?

Update:
The way presented in textbooks using a prime number is required for certain properties to work (e.g. quadratic probing needs a prime size table to prove that e.g. if a table is not full item X will be inserted).
The link posted as duplicate asks generally about increasing by any number e.g. 25% or next prime and the answer accepted states that we double in order to keep the resizing operation "rare" so we can guarantee amortized time.
This does not answer the question of having a table size that is prime and using a prime for resizing such that is even greater than double. So the idea is to keep the properties of the prime size taking into account the resizing overhead

Jim
  • 18,826
  • 34
  • 135
  • 254
  • 1
    There's also a good discussion of it at http://stackoverflow.com/a/1147232/1076640. Focus especially on the part that includes, "So you rely on the hash function not to use even multipliers." – yshavit May 21 '15 at 19:39
  • http://stackoverflow.com/a/2386132/139010 – Matt Ball May 21 '15 at 19:39
  • I don't understand, which implementations double it? If its random example online that makes sense your not gonna get the most efficient and technical example . – andre May 21 '15 at 19:39
  • @andre http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8-b132/java/util/HashMap.java#HashMap.resize%28%29 – Sotirios Delimanolis May 21 '15 at 19:41
  • @andre java.util.HashMap doubles, for instance: http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/util/HashMap.java#772 – yshavit May 21 '15 at 19:41
  • @SotiriosDelimanolis:Did you even read my question before marking it as duplicate? – Jim May 21 '15 at 19:59
  • Please edit your question to demonstrate why you think it isn't a duplicate. I'll review, so will others, and we can reopen it. – Sotirios Delimanolis May 21 '15 at 20:03
  • @andre:Well search for any language and an example of an implementation and you will see – Jim May 21 '15 at 20:30
  • Forget the question in the linked duplicate. Does the accepted answer address your question about the reasoning behind the size doubling? Your last paragraph in the edit seems to indicate that your question is about how we can improve hashing data structures by changing the method we use to increase its size. Your original question, to me, was why do we double. It seems to me the linked duplicate addresses that. – Sotirios Delimanolis May 21 '15 at 20:46
  • @SotiriosDelimanolis:The accepted answer from my point of view is irrelevant. It addresses the fact that in order to keep a constant amortized insert time we must avoid frequent resizes of the table which is essentially a linear operation and hence prefer to double instead of going by constant increments (even prime). My question is related to the actual size before or after resizing. Since we can also keep the resizing a "rare" operation by choosing large primes (greater than doubling) – Jim May 21 '15 at 20:55
  • Your question is more unclear to me now than before. I've reopened. I'll let others decide. – Sotirios Delimanolis May 21 '15 at 21:10
  • 1
    Lookups when your table is sized by a power of 2 are faster because remainder can be done with a bitmask, but that's more of a micro optimization. – David Ehrmann May 21 '15 at 21:17
  • Can you provide the reference you refer to that asks for the hash table to be prime number? As far as I remember, this requirement is from the hash function not the table. In Cormen's *Introduction to Algorithms*, the only place it seems he's suggesting using a prime number for the table's size, is as alternative for not doing it in the hash function itself. – amit May 21 '15 at 21:47
  • And java is not implementing open addressing hash table... – amit May 21 '15 at 21:52
  • @amit:It is not clear to me if all variations need to be prime or certain. E.g. quadratic probing needs a prime size(check any reference even wiki) in order to work. Otherwise e.g. an item may fail to be inserted despite being empty space in the table. For linear probing seems to not matter and for other variations I am unclear to be honest. So I am not sure if e.g. only variations that do not require a prime table size are implemented in reality (e.g. external chaining) – Jim May 21 '15 at 21:56
  • @Jim [Java implements external chaining](http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/util/HashMap.java#HashMap.get%28java.lang.Object%29). Reading cormen's section of hash tables, he keeps saying using power of 2 is done in practice, and there is no problem with it if the hash function `h(x)` is doing modolus in a prime number relative to the table's size (power of 2) – amit May 21 '15 at 21:59
  • Also, I am not bit manipulation expert, but I assume that what [`hash(int)`](http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/util/HashMap.java#HashMap.hash%28int%29) does. – amit May 21 '15 at 22:01
  • @amit:Yes there is no problem if e.g. you implement external chaining. There is a problem if you implement quadratic probing. – Jim May 21 '15 at 22:01
  • 1
    And java's hash table is implemented as external chaining, so there is no problem. I am not following the question. – amit May 21 '15 at 22:04
  • @amit:Is that what `hash(int)` does? Very interesting. I didn't know we can figure out a prime number just by shifting – Jim May 21 '15 at 22:06
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/78475/discussion-between-amit-and-jim). – amit May 21 '15 at 22:06
  • 1
    What we should bear in mind is that built-in Java collections are all based on a compromise to a certain extent: they must work reasonably well for an incredibly wide spectrum of usage patterns. In an application you can get away with re-implementing collections using algorithms that are better suited for your particular use case at the expense of performing worse in other situations. Which is what a lot of people have done. – biziclop May 22 '15 at 11:50
  • @amit:Did you have time to checkout what we discussed?I am interested in your opinion – Jim May 25 '15 at 18:40
  • @Jim Didn't find a digital copy of this book and did not have time to get to the library to see a hard copy, sorry. – amit May 25 '15 at 19:12

2 Answers2

6

Q: But most textbooks say that the best size for the table is a prime number.

Regarding size primality:

What comes to primality of size, it depends on collision resolution algorithm your choose. Some algorithms require prime table size (double hashing, quadratic hashing), others don't, and they could benefit from table size of power of 2, because it allows very cheap modulo operations. However, when closest "available table sizes" differ in 2 times, memory usage of hash table might be unreliable. So, even using linear hashing or separate chaining, you can choose non power of 2 size. In this case, in turn, it's worth to choose particulary prime size, because:

If you pick prime table size (either because algorithm requires this, or because you are not satisfied with memory usage unreliability implied by power-of-2 size), table slot computation (modulo by table size) could be combined with hashing. See this answer for more.

The point that table size of power of 2 is undesirable when hash function distribution is bad (from the answer by Neil Coffey) is impractical, because even if you have bad hash function, avalanching it and still using power-of-2 size would be faster that switching to prime table size, because a single integral division is still slower on modern CPUs that several of multimplications and shift operations, required by good avalanching functions, e. g. from MurmurHash3.


Q: Also to be honest I got lost a bit on if you actually recommend primes or not. Seems that it depends on the hash table variant and the quality of the hash function?

  1. Quality of hash function doesn't matter, you always can "improve" hash function by MurMur3 avalancing, that is cheaper than switching to prime table size from power-of-2 table size, see above.

  2. I recommend choosing prime size, with QHash or quadratic hash algorithm (aren't same), only when you need precise control over hash table load factor and predictably high actual loads. With power-of-2 table size, minimum resize factor is 2, and generally we cannot guarantee the hash table will have actual load factor any higher than 0.5. See this answer.

    Otherwise, I recommend to go with power-of-2 sized hash table with linear probing.

Q: Is the approach of doubling because:
It is easy to implement, or

Basically, in many cases, yes. See this large answer regarding load factors:

Load factor is not an essential part of hash table data structure -- it is the way to define rules of behaviour for the dymamic system (growing/shrinking hash table is a dynamic system).

Moreover, in my opinion, in 95% of modern hash table cases this way is over simplified, dynamic systems behave suboptimally.

What is doubling? It's just the simpliest resizing strategy. The strategy could be arbitrarily complex, performing optimally in your use cases. It could consider present hash table size, growth intensity (how much get operations were done since previous resize), etc. Nobody forbids you to implement such custom resizing logic.

Q: Is finding a prime number too inefficient (but I think that finding the next prime going over n+=2 and testing for primality using modulo is O(loglogN) which is cheap)

There is a good practice to precompute some subset of prime hash table sizes, to choose between them using binary search in runtime. See the list double hash capacities and explaination, QHash capacities. Or, even using direct lookup, that is very fast.

Q: Or this is my misunderstanding and only certain hashtable variants only require prime table size?

Yes, only certain types requre, see above.

Community
  • 1
  • 1
leventov
  • 14,760
  • 11
  • 69
  • 98
  • Thank you for your answer. First of all what does `when closest "available table sizes" differ in 2 times memory usage of hash table might be unreliable` mean? – Jim May 22 '15 at 07:54
  • Also to be honest I got lost a bit on if you actually recommend primes or not. Seems that it depends on the hash table variant **and** the quality of the hash function? – Jim May 22 '15 at 09:51
  • @Jim it means the same as "With power-of-2 table size, minimum resize factor is 2, and generally we cannot guarantee the hash table will have actual load factor any higher than 0.5." – leventov May 22 '15 at 11:44
  • `I recommend choosing prime size, with QHash or quadratic hash algorithm (aren't same), only when you need precise control over hash table load factor and predictably high actual loads` but for quadratic version load table should be less than 50% to be effective – Jim May 25 '15 at 18:44
  • `Otherwise, I recommend to go with power-of-2 sized hash table with linear probing.` You mean you recommend over external chaining too? – Jim May 25 '15 at 18:45
  • @Jim `quadratic version load table should be less than 50% to be effective` I don't understand what do you mean. Any hash table more effective with 0.5 load factor, than with high load factor. Probably you confuse quadratic and linear probing. – leventov May 25 '15 at 18:49
  • @Jim [linear probing](http://en.wikipedia.org/wiki/Linear_probing) != external chaining, linear probing is collistion resolution technique for *open addressing* – leventov May 25 '15 at 18:51
  • I misunderstand your sentence. I think we agree that quadratic is most effective when table is half empty right? So I am not clear what you mean with `hen you need precise control over hash table load factor and predictably high actual loads`. You mean that I can tolerate half empty table? – Jim May 25 '15 at 18:52
  • `I think we agree that quadratic is most effective when table is half empty right` no, linear probing is better than quadratic, that is why I suggested it in the answer body. – leventov May 25 '15 at 18:53
  • `You mean that I can tolerate half empty table?` If 0.5 load factor is OK for you, choose **linear probing and power-of-2 table size**. If you need load factor been *guaranteed* over 0.5, you cannot choose power-of-2 size, simply because you cannot guarantee that with power-of-2 size. In this case I suggest to choose prime size & quadratic. – leventov May 25 '15 at 18:57
  • For a high load factor and going for prime sizes, is there a load factor that usually best to not exceed. I assume that if we reach 0.95 or 1 we are at O(N) instead of O(1) due to the many searches/collisions – Jim May 25 '15 at 19:02
  • @Jim just choose as little load factor as acceptable. The is no certain "red" threshold. It depends on how much RAM / memory for the process you allocate, and volume of the data. For example, if you have 5 GB of memory and 4 GB of data, you should choose 0.8. I can hardly imagine anybody need more than 0.9 – leventov May 25 '15 at 19:21
  • To be honest I can not find a reference stating that quadratic probing works good for high load factors. Example in Limitations: http://en.wikipedia.org/wiki/Quadratic_probing – Jim May 25 '15 at 20:20
  • Also strange enough I had a discussion with amit (in the comments and chat) and he could not find any reference about the primality needed for the quadratic using the book of Cormen as a reference – Jim May 25 '15 at 20:25
  • @Jim in Wikipedia, the passage about "degrading performance" mean nothing different from general hash table rule "higher load factor - less perf", applying as well to double hashing, linear probing, external chaining etc. Regarding "no guarantee" -- see [QHash](https://github.com/OpenHFT/Koloboke/wiki/QHash), which does have guarantee to find a free slot until map becomes completely full, because probing sequence is merely a slot permutation. – leventov May 25 '15 at 20:26
  • @Jim don't need -- don't have probing sequence guarantees. For QHash, size must be a prime of 4k + 3. See linked page. – leventov May 25 '15 at 20:28
  • I was wondering why do you recommend using static tables with preloaded primes and use binary search? I thought that finding a prime (by incrementing `n+=2` is already a `log(logN)` so it is more optimal than binary search. Am I wrong here? – Jim May 26 '15 at 19:14
  • I don't clearly understand what you mean under "finding prime by incrementing n+=2". Finding a number in a table of precomputed primes of size 3000 using binary search is ~10 comparisons and binary search steps. The prime might be about 1 mln, 1 bln. I don't understand how are you going to find a prime "close to 1 bln" faster. – leventov May 26 '15 at 19:21
  • Something like: `private int nextPrime(int n) { if (n %2 == 0) n++; for(;!isPrime(n);n+=2); return n;` But this takes N^(1/2)LogN. So I wasnt right that it is faster. But is it too much overhead overall to need extra array with the precomputed numbers? – Jim May 26 '15 at 19:27
  • Also one more thing that confused me. You say that quadratic probing is more memory efficient i.e. works well in high load factor right? But in high load factor we have more elements in the table, and with more elements in the table it is more likely that clusters will form (primary or secondary) and that the clusters will be large. Additionally quadratic probing suffers from secondary clustering. So how is quadratic probing efficient in high load factors as well? Do you mean efficient comparing to linear probing? – Jim May 26 '15 at 19:29
  • @Jim this array is static (one per virtual machine / process). You also can pick up [the linked precomputed tables source code](https://github.com/OpenHFT/Koloboke/blob/344089c9fc7c2b53ba7d1299eb29214206e1ab1d/lib/impl/src/main/java/net/openhft/koloboke/collect/impl/hash/QHashCapacities.java#L288), this is tested, highly optimized source. – leventov May 26 '15 at 19:29
  • Yes, I mean "efficient compared to linear probing". – leventov May 26 '15 at 19:30
  • May be you can also checkout http://stackoverflow.com/questions/30489556/can-we-implement-cuckoo-hashing-using-1-table if you have some free time please? – Jim May 27 '15 at 17:44
  • @Jim I'm not an expert in Cuckoo algorithm – leventov May 27 '15 at 17:48
  • I recently read about it and it seems good but it is unknown. If you haven't worked with it, does it mean that it isn't that good? – Jim May 27 '15 at 20:08
  • @Jim at least I don't believe it could beat linear hashing. But I could be wrong. – leventov May 27 '15 at 21:00
  • @Jim for high loads, I could imagine some "complex" algorithm (like Cuckoo, Robin-Hood, etc.) is better than quardatic, but I haven't seen any evidence yet. – leventov May 27 '15 at 21:02
  • I am sorry for my "trivial" question but what constitutes as evidence? Isn't a simple benchmark adequate? – Jim May 28 '15 at 07:29
  • Some people state RobinHood outperforms simple linear hash, but my own benchmarks didn't prove that. In addition, robin hood requres to store the hash code along with keys/values in table slots, that is more suitable for C/C++ than for Java, because java's way is to cache the hash code *inside* object (e. g. see java.lang.String). I. e. in java it would lead to unneccessary "double-caching" – leventov May 31 '15 at 15:10
3

Java HashMap (java.util.HashMap) chains bucket collisions in a linked list (or [as of JDK8] tree depending on the size and overfilling of bins).

Consequently theories about secondary probing functions don't apply. It seems the message 'use primes sizes for hash tables' has become detached from the circumstances it applies over the years...

Using powers of two has the advantage (as noted by other answers) of reducing the hash-value to a table entry can be achieved by a bit-mask. Integer division is relatively expensive and in high performance situations this can help.

I'm going to observe that "redistributing the collision chains when rehashing is a cinch for tables that are a power of two going to a power of two".

Notice that when using powers of two rehashing to twice the size 'splits' each bucket between two buckets based on the 'next' bit of the hash-code. That is, if the hash-table had 256 buckets and so using the lowest 8 bits of the hash-value rehashing splits each collision chain based on the 9th bit and either remains in the same bucket B (9th bit is 0) or goes to bucket B+256 (9th bit is 1). Such splitting can preserve/take advantage of the bucket handling approach. For example, java.util.HashMap keeps small buckets sorted in reverse order of insertion and then splits them into two sub-structures obeying that order. It keeps big buckets in a binary tree sorted by hash-code and similarly splits the tree to preserve that order.

NB: These tricks weren't implemented until JDK8.

(I am pretty sure) Java.util.HashMap only sizes up (never down). But there are similar efficiencies to halving a hash-table as doubling it.

One 'downside' of this strategy is that Object implementers aren't explicitly required to make sure the low order bits of hash-codes are well distributed. A perfectly valid hash-code could be well distributed overall but poorly distributed in its low order bits. So an object obeying the general contract for hashCode() might still tank when actually used in a HashMap! Java.util.HashMap mitigates this by applying an additional hash 'spread' onto the provided hashCode() implementation. That 'spread' is really quick crude (xors the 16 high bits with the low).

Object implmenters should be aware (if not already) that bias in their hash-code (or lack thereof) can have a significant effect on the performance of data structures using hashes.

For the record I've based this analysis on this copy of the source:

http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/687fd7c7986d/src/share/classes/java/util/HashMap.java

Persixty
  • 8,165
  • 2
  • 13
  • 35
  • Very interesting analysis. But JDK does resize the table by doubling. And I assume that they chose the external chaning because it is easy to implement. But I was wondering if a prime table can not be used with the same bit "tricks" that you mention – Jim May 25 '15 at 18:49
  • Also about halving the hash table that you mention. I have never read any textbook even implying that technique. Is this actually used in any implementation? – Jim May 25 '15 at 18:50
  • @Jim Yes, it does resize by doubling. The 'trick' was to simplify doubling by realising that the collision buckets get neatly halved. Before JDK8 the code clearly doesn't use *that* trick. Halving is an often overlooked operation. However for complex or long running applications it may be necessary. I notice `Java.util.HashMap` doesn't bother. Ever noticed how servers benefit from a periodic reboot? Clog such as not unloading classes and internal 'buffers' getting sized up but never down is a reason why a system that never leaks is unnecessarily holding idle resources. – Persixty May 26 '15 at 06:56
  • I am not sure if that will solve your point. Even in C++ when you free memory it does not go to the system for reuse but it is reserved to be reused by the same process in future request. As a result the memory of the process as a whole never goes down. This much noticeable with applications like firefox or chrome that you have open for a long time. They end up consuming most of your system. So I am not sure if the effort to halving the table would save us from the period reboots that you correctly point out. Do I make sense? – Jim May 26 '15 at 07:35
  • @Jim. The C++ standard makes no such stipulation. This is getting off topic into a discussion about what happens under the hood of `new` and `delete`. With Visual Studio for example they call through to `malloc(.)` and `free()`. You will note here that `free(.)` may result in resources being handed back. https://msdn.microsoft.com/en-us/library/we1whae7.aspx. It would be a very poor platform/run-time that *never* handed resources back. See Windows 95/98 to prove the point! They didn't do this but were equally thought of as "Worst. Operating Systems. Ever." – Persixty May 26 '15 at 09:30
  • I take your point about browers generally gobbling and gobbling and gobbling resources. But I would say they're rather selfish prima donna applications that in general can't be said to 'play well with others'. Many well written long running applications take great care to reallocate and hand back resources. I have no idea what the Chrome NFRs are but if you can browse for 12 hours straight without a restart I suspect they're high-fiving. – Persixty May 26 '15 at 09:40
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/78778/discussion-between-dan-allen-and-jim). – Persixty May 26 '15 at 09:46
  • Check this http://www.cplusplus-soup.com/2010/01/freedelete-not-returning-memory-to-os.html – Jim May 26 '15 at 12:00
  • Also this question http://stackoverflow.com/questions/30458195/does-gc-release-back-memory-to-os#30458195 – Jim May 26 '15 at 12:29
  • @Jim please do consider responding in the chat. There are several points. Firstly, I can't link to the referred article under the 'soup' posting but even there it states 'Occasionally, free can actually return memory to the operating system and make the process smaller.' I'm absolutely saying `free(.)` tends to hold back a bit of memory but I'm also insisting that most implementations *eventually* release *some* memory back. Also that article is about a completely different platform to the one I referenced. It's certain the C++ Standard does not specify detailed behavior in this area. – Persixty May 26 '15 at 12:52
  • On the GNU platform have a look at http://www.gnu.org/software/libc/manual/html_node/Malloc-Tunable-Parameters.html#Malloc-Tunable-Parameters and `M_TRIM_THRESHOLD` in particular. It's appropriate for some (even many) applications to take and never give back. It certainly isn't appropriate in all circumstances and absolutely definitely is not a requirement of any C++ standard. I can sit here and watch this C++ program in System Monitor rise to 2Gb allocation (and slow the whole show down) and fall to ~40Kb and see the system come back to life. Surrendering resources is real! – Persixty May 26 '15 at 12:57
  • I am responding to the chat as well. I think C++ is relevant since JRE is written in C++. I opened this http://stackoverflow.com/questions/30458195/does-gc-release-back-memory-to-os Can you please have a look? – Jim May 26 '15 at 17:57