0

I'm currently experimenting with different hashtable implementations (in C if it matters), and one of the implementations that I'm examining uses open-addressing, linear probing, and no tombstones.

I am using an algorithm for removals from the following stack overflow answer.

https://stackoverflow.com/a/24886657/7623056

In this algorithm, we need to check if a slot is unoccupied. This makes me think that we need additional state to determine whether slots are occupied or unoccupied. For example, we simply cannot check if all bits in the slot are zero, because it is possible to insert a valid key, value where all bits just happen to be zero, even if this is extremely unlikely.

Are there any strategies that could eliminate this additional state?

One possibility is to force the user to supply a key that will never be used, which will then be stored in all empty locations. However, this is undesirable in situations where the user cannot predict which keys could remain unused.

Another possibility that I can think of would be to store the hash with the key and develop a hash function that is guaranteed to never produce one specific value. This hash value will then be stored in all empty slots. For example, we could select the hash value 0. Then, in the real hash function, any time 0 is produced, we could just return 1. However this seems like an undesirable solution because this will remove the uniform distribution property of our hash function.

Edit: I've been thinking about this more, and I think that the second solution might be the best. Since we mod the hash value by the number of buckets, the uniform distribution property isn't as important for hashtables as it is for cryptographic applications such as an hMAC. So maybe a better question is: will this relaxation of the hash function lead to any undesirable consequences that I haven't thought of? For example, would this make it easier for an attacker to reverse engineer the hash function and perform a DOS attack? Could this be mitigated with robin-hood hashing, since it minimizes the cluster size?

  • It seems to me that your two proposed solutions are in fact effectively identical. – 500 - Internal Server Error Dec 23 '22 at 20:58
  • @500-InternalServerError I actually disagree. The first solution does not limit the allowed bounds of the result of the hash function, while the second solution does. However, if the hash function is bijective (which isn't guaranteed), then you are correct, and they would be effectively the same. – labmonkey398 Dec 23 '22 at 21:14

0 Answers0