0

For example, if writing a dictionary class, collisions are rare, but they do exist. As a result, you need to store the key to make sure that when you find your key in the hashtable, it is the right one and not a collision.

Sometimes the keys are long and they're usually strings, so each key can be over 40 bytes, compared to if it was just a hash code. What if the key stored was the object hashed but using a slightly different hashing algorithm, with different prime numbers? Then the chances of a collision would be (1/(2^32)) * (1/(2^32)).

You can even have another hashing algorithm and store that hash instead, so the chances of a collision would be (1/(2^32)) * (1/(2^32)) * (1/(2^32)). Obviously a collision could STILL happen, but the chances are so low, and you save so much memory by only having to store a 4 bytes for a key instead of over 32 bytes.

I guess this is still not acceptable though, right, because there's still a CHANCE, but there's also a chance someone's RAM could accidentally flip a bit and bluescreen, and this just seems so unlikely it's tempting not to implement. Are there any alternatives or is the small chance still not worth it?

John Smith
  • 8,567
  • 13
  • 51
  • 74
  • The point of a key in a `dictionary` is that you have a known unique indexer and you want to find the value stored at that index. If you have some off-the-wall hash that you can only get from the object itself, you probably don't need a dictionary, just a list. – ps2goat Mar 03 '15 at 19:55
  • 2
    If your hash is 32 bits, the probability of collision will be quite higher than 1/2^32. And you will need 4 bytes for each such hash function, so 12 bytes for 3 of them. It's not at all unlikely to happen with such small hashes, so if you must absolutely avoid collisions, I suggest you spare the 40 bytes. – IVlad Mar 03 '15 at 19:56
  • You're also assuming a perfect hash function with a perfectly distributed probability of every value, which is basically never the case in reality. – Servy Mar 03 '15 at 19:57
  • Even if you had store 6 hashes, each calculated with different prime numbers, it's still less memory used than a string key hash, and the probabilities of a collision would be close to 1 in 10^50. I know there's still a chance and hashing isn't perfect but it seems tempting.. – John Smith Mar 03 '15 at 20:00
  • No, it wouldn't be in 1 in 10^50. Where are you getting these numbers? It would be a lot more likely than that in practice, depending on the actual algorithms used for the hashes. – IVlad Mar 03 '15 at 20:03
  • This is sort of how a HashSet works. – JNYRanger Mar 03 '15 at 20:04
  • @IVlad: I'm getting the number from a hash being a 32-bit answer and with six different hashing algorithms used on the same key, and each hash stored, is around `(1/(2^32)) * (1/(2^32)) * (1/(2^32)) * (1/(2^32)) * (1/(2^32)) * (1/(2^32))`, or `1.59 × 10^-58` Obviously hashing isn't perfectly uniform so I took out many orders of magnitude to compensate. – John Smith Mar 03 '15 at 20:04

2 Answers2

2

If you want to make 100% sure there arent any collison there is no way arround checking for the key before inserting. That being said, we're lucky here because a well implemented dictionnary is exactly what you need in order to quickly find a key.

That being said, you might want to take a look at the function described here. The collision chances will be rather low

EDIT: Removed some non-sense I wrote about GUIDs...

Community
  • 1
  • 1
  • Agreed: a GUID is a much better idea than manually combining who knows what kind of hash functions. – IVlad Mar 03 '15 at 20:07
  • How would you know the key pointed to the GUID though? There's no association. – John Smith Mar 03 '15 at 20:15
  • You are right, this portion of my answer was silly beyond repair. I maintain your keys are stored *de facto* and are easy to check. Let me edit this... – Renaud Gauthier Mar 03 '15 at 20:28
1

It depends.

Do you absolutely need to guarantee collision resolution? If so: you have to store the key, or something equivalent to it. In some cases (e.g. small keyspace, redundant data, etc.) you can use compression or custom hash functions to reversibly map the key to something smaller.

If not: yes, your approach would work. Note that, due to the birthday paradox, the probability of collision is:

  • dependent on the number of elements already in the collection; and
  • higher than you think.

There's a tradeoff: now you have to compute (and compare) several hashes to find items.

Following further down this path: why have a fixed number of hashes? You could compute one hash, and only compute the next if there's a collision; this leads to a trie-based implementation. (Of course, you then need a reliably-distributed family of hash functions...)

Most of this is overkill for all but the most high-performance and/or memory-constrained applications - but it is occasionally useful to do things like this :)

candu
  • 2,827
  • 23
  • 18