0
  1. If the input data entries are around 10 raised to power of 9, do we keep the size of the hash table the same as input size or reduce the size? how to decide the table size?
  2. if we are using numbers in the range of 10 raised to power of 6 as the key, how do we hash these numbers to smaller values? I know we use the modulo operator but module with what?

Kindly explain how these two things work. Its getting quite confusing. Thanks!!

I tried to make the table size around 75% of the input data size, that you can call as X. Then I did key%(X) to get the hash code. But I am not sure if this is correct.

stoogie
  • 13
  • 2
  • Which hash table are you talking about? How does it resolve collisions? Does it use separate changing or open addressing? Are you asking about some custom implementation, or standard library implementations? Do you know what a load factor is and how rehashing works? (BTW, I would recommend adding the C++ question tag. Your question will get more attention then.) – Daniel Langr Jan 31 '23 at 09:47
  • Hi Thanks for suggesting to add c++ tag. have done that. I am talking about implementing a hash table in c++, preferably using std::list(an array of lists). You might have guessed that I want to use separate chaining. I know load factor is (n/m). n= number of elements in hash table currently. m= total size of hashtable. Not sure what is rehashing – stoogie Jan 31 '23 at 11:47
  • 1
    Typical implementations don't use a fixed number of buckets (this is likely what you call a "size", which is misleading since size is usually used for the number of table elements/records). Instead, the bucket array can grow when the load factor exceeds some threshold. This is called rehashing. Anyway, as far as I understand, in your implementation, the number of buckets is fixed? – Daniel Langr Jan 31 '23 at 11:54
  • If I talk about the specific problem at hand. I would no its not fixed. Theres scope for increasing size – stoogie Jan 31 '23 at 11:57

2 Answers2

0

If the input data entries are around 10 raised to power of 9, do we keep the size of the hash table the same as input size or reduce the size? how to decide the table size?

The ratio of the number of elements stored to the number of buckets in the hash table is known as the load factor. In a separate chaining implementation, I'd suggest doing what std::unordered_set et al do and keeping it roughly in the range 0.5 to 1.0. So, for 10^9 elements have 10^9 to 2x10^9 buckets. Luckily, with separate chaining nothing awful happens if you go a bit outside this range (lower load factors just waste some memory on extra unused buckets, and higher load factors lead to increased collisions, longer lists and search times, but at load factors under 5 or 10 with an ok hash function the slow down will be roughly linear on average (so 5 or 10x slower than at load factor 1).

One important decision you should make is whether to pick a number around this magnitude that is a power of two, or a prime number. Explaining the implications is tedious, and anyway - which will work best for you is best determined by trying both and measuring the performance (if you really have to care about smallish differences in performance; if not - a prime number is the safer bet).

if we are using numbers in the range of 10 raised to power of 6 as the key, how do we hash these numbers to smaller values? I know we use the modulo operator but module with what?

Are these keys unsigned integers? In general, you can't have only 10^6 potential keys and end up with 10^9 hash table entries, as hash tables don't normally store duplicates (std::unordered_multiset/multi_map can, but it'll be easier for you to model that kind of thing as being a hash table from distinct keys to a container or values). More generally, it's best to separate the act of hashing (which usually is expected to generate a size_t result), from the "folding" of the hash value over the number of buckets in the hash table. That folding can be done using % in the general case, or by bitwise-ANDing with a bitmask for power-of-two bucket counts (e.g. for 256 buckets, & 255 is the same as % 256, but may execute faster on the CPU when those 255/256 values aren't known at compile time).

I tried to make the table size around 75% of the input data size, that you can call as X.

So that's a load factor around 1.33, which is ok.

Then I did key%(X) to get the hash code. But I am not sure if this is correct.

It ends up being the same thing, but I'd suggest thinking of that as having a hash function hash(key) = key, followed by mod-ing into the bucket count. Such a hash function is known as an identity hash function, and is the implementation used for integers by all major C++ compiler Standard Libraries, though no particular hash functions are specified in the C++ Standard. It tends to work ok, but if your integer keys are particularly prone to collisions (for example, if they were all distinct multiples of 16 and your bucket count was a power of two they'd tend to only map to every 16th bucket) then it'd be better to use a stronger hash function. There are other questions about that - e.g. What integer hash function are good that accepts an integer hash key?

Rehashing

If the number of elements may increase dramatically beyond your initial expectations at run-time, then you'll want to increase the number of buckets to keep the load factor reasonable (in the range discussed above). Implementing support for that can easily be done by first writing a hash table class that doesn't support rehashing - simply taking the number of buckets to use as a constructor argument. Then write an outer rehashing-capable hash table class with a data member of the above type, and when an insert would push the load factor too high (Standard Library containers have a max_load_factor member which defaults to 1.0), you can construct an additional inner hash table object telling the constructor a new larger bucket count to use, then iterate over the smaller hash table inserting (or - better - moving, see below) the elements to the new hash table, then swap the two hash tables so the data member ends up with the new larger content and the smaller one is destructed. By "moving" above With a little I mean simply relink linked list elements from the smaller hash table into the lists in the larger one, instead of deep copying the elements, which will be dramatically faster and use less memory momentarily while rehashing.

Tony Delroy
  • 102,968
  • 15
  • 177
  • 252
0

I don't know if this is a proper way to answer Qs on here. I wanted to post this in a comment but I lack the rep.

But, you know, how about let the profiler and your competition tell you how good or bad your solution is? If your top hotspot is showing a boatload of cache misses, the profiler basically told you how to improve it and make a second draft. If you're competing against a search engine, see how fast they can look up results and then measure yours in comparison. If it's worse, then maybe you need to research more new techniques or look more closely at your hotspots. If it's better, and especially far better, you might have come across a breakthrough in your industry.

For load factors/adapting table size though, I don't even think 1 is that large if you only require 32-bits for an empty node, e.g. If you're going to be wasteful and use 64-bits per node with the convenience of pointers on 64-bit architectures, it's quite a bit more wasteful and you'll typically double your cache misses if you're only storing small keys/values, like just a 32-bit integer in a hash set.

Prime number table sizes are a huge win in all my measurements. I come from image processing so I came with the bias that integer div/modulo is so expensive [we often use power-of-two sizes in image processing to avoid repeated division and modulo we can replace with bitshifts and bitwise AND to access pixel rows of images], but that was totally wrong, and my profiler taught me faster than anyone else could have how wrong I was.

For hash algorithms, I've tried all sorts like DJB2, mmhash, CityHash, ones that use 64-bit hashes and SIMD, etc. I never have cryptographic needs, just want to make the most efficient associative containers. DJB2 was one of the best I've found for my use cases and also ridiculously simple to implement, but usually table size is far more important in my experience than minutely-better hashing algorithms for both linear probing and separate chaining and all other variants.

One thing that seems obvious to me but I'm going to point it out just in case is that if your hash table compares keys a lot or first compares hashes for equality before keys, you will probably benefit from storing keys, hashes, and/or values in separate memory blocks (ex: parallel arrays/SoA rep). It's 101 hot/cold field splitting but I've disappointingly found so many implementations disregarding it and just storing keys, values, and possibly hash all interleaved in an AoS rep loading mostly-irrelevant data into cache lines left and right.

Also for people who think separate chaining is slower than linear probing, they're probably thinking in a very serial mindset and with the crude implementations that require 64-bit pointers on 64-bit architectures with buckets that allocate a node one at a time with a general-purpose, variable-length allocator. Separate chaining is way underrated and especially if your hash table handles parallel requests for insertions/deletions/searches and it's clear even by the name, "Probing". That's going to want to read and write to the same shared data across threads. You will generally end up with way more false sharing with a probing solution in those cases, even carefully using techniques like TTAS, even if it might perform best in single-threaded use cases. Computer science courses are getting really backwards these days; we have 4+ cores machines even on full mini PCs costing $300. They teach it like we only have one physical thread still. I implemented one the other day using DJB2 for the hash and prime table sizes that can insert 1024*1024 elements for a map of int32_t keys and values in under 6ms (faster than std::vector<int> in C++ which requires 15ms to insert 1024*1024 int32_ts on my i7-2600K, in spite of the fact that it's requiring twice as much data for both keys and values and doing a bucket search with each insertion to avoid inserting duplicates).

So I hope that wasn't an entirely useless answer but I was going with the vibes. Like I think you lack confidence and measuring more (both with tools like profilers) and against what you're competing against will let you know where your solutions stand in the industry.