3

If my intention is only to have a good hash function that spreads data evenly into all of the buckets, then I need not come up with a family of hash functions, I could just do with one good hash function, is that correct?

The purpose of having a family of hash functions is only to make it harder for the enemy to build a pathological data set as when we pick a hash function randomly, he/she has no information about which hash function is employed. Is my understanding right?

EDIT: Since someone is trying to close as unclear; This question is to know the real purpose of employing a Universal family of hash functions.

Aravind
  • 3,169
  • 3
  • 23
  • 37

1 Answers1

0

I could just do with one good hash function, is that correct?

As you note later in your question, an "enemy" who knows which hash function you're using could prepare a pathological data set.

Further, hashing is just the first stage in storing data into your table's buckets - if you're implementing open addressing / closed hashing, you also need to select alternative buckets to probe after collisions: simple approaches like linear and quadratic probing generally provide adequate collision avoidance, and are likely mathematically simpler and therefore faster than rehashing, but they don't maintain a probability of the next probe finding an unused bucket at the load factor. Rehashing with another good hash function (including another from a family of such functions) does, so if that's important to you you may prefer to use a family of hash functions.

Note too that sometimes an in-memory hash table is used to say at which offsets/sectors on disk data is stored, so extra rehashing calculations with already-in-memory data may be far more appealing than a higher probability (with linear/quadratic probing) of waiting on disk I/O only to find another collision.

Tony Delroy
  • 102,968
  • 15
  • 177
  • 252
  • I understand what you are saying, but I think we are confusing two things here. One is double hashing to find the probe sequence, and the other is the Universal family of hash functions. From my understanding these are 2 different things. To make things simple, let's employ chaining as the means of collision resolution, now, if I am not worried about enemy attacks, I need not bother implementing an Universal family of hash functions, is that right? I could just pick one good hash function and use it for my implementation, is that right? – Aravind Feb 08 '16 at 04:33
  • @Aravind: you can indeed do *double hashing* without a universal family of hash functions: to quote [Wikipedia](https://en.wikipedia.org/wiki/Double_hashing) *"...hash functions h1 and h2, the ith location in the bucket sequence for value k in a hash table T is: h(i,k)=(h1(k) + i * h2(k)) mod |T|."* - only using two hash functions. Never-the-less, Wikipedia continues *"Generally, h_1 and h_2 are selected from a set of universal hash functions."*. It's also possible to use different hash functions from the family for each successive probe, rather than *i * h2(k)* typical of double hashing. – Tony Delroy Feb 08 '16 at 06:02
  • Anyway, my point was indeed that such families of hash functions can give different collision proneness, and you *may* care even in situations where you don't have an "enemy" trying to cause collisions. If you have neither special needs for predictable post-collision open addressing and no enemy, I can't think of another reason to jump at a universal family of hash functions. Needing them in everyday software that's not 'net-attack robust is the exception, rather than the rule. – Tony Delroy Feb 08 '16 at 06:05
  • Okay! I have one question though: "It's also possible to use different hash functions from the family for each successive probe, rather than i * h2(k) typical of double hashing.". Different hash functions for each successive probe? I thought the hash functions are picked at random at the start of the usage of the hash table. Where can I read more about this approach of picking random hash functions for each probe? How is the function used for a specific probe stored to lookup keys later? – Aravind Feb 08 '16 at 06:40
  • @Aravind: *" I thought the hash functions are picked at random at the start of the usage of the hash table."* - they can be, to avoid malicious attacks, or e.g. to randomise the hash function and hence order of iteration, so code that incorrectly assumes something about that order is more likely to fail early (hopefully in testing). That said, you can use families of hash functions even when *not* selecting any of them at random: you might start with `h(key)`, if collides, try `h'(key)`, then `h''(key)` etc. - where h/h'/h'' are successive family members, until you find an unused bucket. – Tony Delroy Feb 08 '16 at 07:00
  • Thanks, there are so many ways to use the family of hash functions. Although I am not very mathematically inclined to check which way is better, nevertheless, my question was answered. – Aravind Feb 08 '16 at 10:43
  • @Aravind: which way is better can not always be answered mathematically either - sometimes it's a function of the sizes of your main and different levels of cache memory, CPU speed etc. - for reliably meaningful results, nothing beats pumping your actual data into different hash table implementations in the host and application context your end-users will have, and making careful timing measurements. Anyway - good luck with your coding. – Tony Delroy Feb 09 '16 at 06:16
  • I have been messing with the math part of it and having a horrible time of it. The most sane advice I read about hash table implementations are just what you say: just benchmark and see how to tune the implementation. I am preparing for a technical interview and want to get my basics clear. So, I would not be coding up a hash table just right now, although it's on my bucket list. Thanks for your patient inputs! – Aravind Feb 09 '16 at 10:52