I need to implement a fixed-size hash map optimized for memory and speed, but am unclear as to what this means: Does it mean that the number of buckets I can have in my hash map is fixed? Or that I cannot dynamically allocate memory and expand the size of my hash map to resolve collisions via linked lists? If the answer to the latter question is yes, the first collision resolution approach that comes to mind is linear probing--can anyone comment on other more memory and speed efficient methods, or point me towards any resources to get started? Thanks!
-
It probably means that the size of buckets is fixed. Each LinkedList (or similar data structure) that starts from each bucket can be of any size. Linear probing and periodic re-hashing are not bad ideas – TheLostMind Oct 26 '15 at 05:12
-
It could be interpreted in many different ways (some of which you have reasonably described). But it's not really something that can be answered by the SO community. Best to go back and clarify with the source of the requirements. – kaylum Oct 26 '15 at 05:45
-
Below two threads might help you. https://stackoverflow.com/questions/24203858/how-can-i-implement-a-fixed-size-hashmap https://stackoverflow.com/questions/29599047/why-is-hash-output-fixed-in-length – user1775015 Feb 20 '18 at 08:45
1 Answers
Without seeing specific requirements it's hard to interpret the meaning of "fixed-size hash map optimized for memory and speed," so I'll focus on the "optimized for memory and speed" aspect.
Memory
Memory efficiency is hard to give advice for particularly if the hash map actually exists as a "fixed" size. In general open-addressing can be more memory efficient because the key and value alone can be stored without needing pointers to next and/or previous linked list nodes. If your hash map is allowed to resize, you'll want to pick a collision resolution strategy that allows for a larger load (elements/capacity) before resizing. 1/2
is a common load used by many hash map implementations, but it means at least 2x
the necessary memory is always used. The collision resolution strategy generally needs to be balanced between speed and memory efficiency, in particular tuned for your actual requirements/use case.
Speed
From a real-world perspective, particularly for smaller hash-maps or those with trivially sized key, the most important aspect to optimizing speed would likely be reducing cache-misses. That means put as much of the information needed to perform operations in contiguous memory spaces as possible.
My advice would be to use open-addressing instead of chaining for collision resolution. This would allow for more of your memory to be contiguous and should be at a minimum 1 less cache miss per key comparison. Open-addressing will require some-kind of probing, but compared to the cost of fetching each link of a linked list from memory looping over several array elements checking key comparison should be faster. See here for a benchmark of c++ std::vector vs std::list, the takeaway being for most operations a normal contiguous array is faster due to spatial locality despite algorithm complexity.
In terms of types of probing, linear probing has an issue of clustering. As collisions occur the adjacent elements are consumed which causes more and more collisions in the same section of the array, this becomes exceptionally important when the table is nearly full. This could be solved with re-hashing, Robin-Hood hashing (As you are probing to insert, if you reach an element closer to it's ideal slot than the element being inserted, swap the two and try to insert that element instead, A much better description can be seen here), etc. Quadratic Probing doesn't have the same problem of clustering that linear probing has, but it has it's own limitation, not every array location can be reached from every other array location, so dependent on size, typically only half the array can be filled before it would need to be resized.
The size of the array is also something that effects performance. The most common sizings are power of 2 sizes, and prime number sizes. Java: A "prime" number or a "power of two" as HashMap size?
Arguments exist for both, but mostly performance will depend on usage, specifically power of two sizes are usually very bad with hashes that are sequential, but the conversion of hash to array index can be done with a single and
operation vs a comparatively expensive mod
.
Notably, the google_dense_hash is a very good hash map, easily outperforming the c++ standard library variation in almost every use case, and uses open-addressing, a power of 2 resizing convention, and quadratic probing. Malte Skarupke wrote an excellent hash table beating the google_dense_hash in many cases, including lookup. His implementation uses robin hood hashing and linear probing, with a probe length limit. It's very well described in a blog post along with excellent benchmarking against other hash tables and a description of the performance gains.