This question is language agnostic. But let's use C and some pseudo code for demonstration.
I want to store a hash-map / hash-table / dictionary / key-value store inside a static array (or optionally multiple static arrays), i.e. I have some upper limit about the number of entries, and have this data structure:
Key keys[N];
Value values[N];
In my case, I have typedef int Key
maybe answers to this can be more generic.
So the implicit dictionary this would represent is (Python syntax):
{keys[i]: values[i] for i in range(N)}
I want that accessing the dict is fast (average O(1)) (for both keys inside the dict, and also keys not inside the dict), and optimally also writing into it, although I will read from it a lot more. Also, in my use case, I would know in advance the dict I want to write. So let's say the D
is the known dict. I would choose N = len(D)
(numbers of entries in D
), although maybe that's already not optimal.
A naive way would be to use some hash function, hash : Key -> uint
, e.g. this (for Key == int
), and then do this (pseudo code):
def init_by_dict(D):
keys = unset # all keys are unset initially
max_hops = 0
for key, value in D:
i = hash(key) % N
cur_max_hops = 0
while keys[i] is set:
i = (i + 1) % N
cur_max_hops += 1
max_hops = max(max_hops, cur_max_hops)
keys[i] = key
values[i] = value
Then a query would look like:
def query(key):
i = hash(key) % N
cur_max_hops = 0
while keys[i] is set and cur_max_hops <= max_hops:
if keys[i] == key:
return values[i]
i = (i + 1) % N
cur_max_hops += 1
return unset
So in worst case it would be O(max_hops). I.e. you would want to keep max_hops low. The naive way for init_by_dict
might already have this property if the hash function is good, but I don't know.
That's basically my question: What are efficient implementations of init_by_dict
and query
for the given data structure?
I also found this blog post which discusses also a very similar setting. And maybe also this question. But I would like to know if there are some well-known theoretical / state-of-the-art approaches on this problem.
I also read about open addressing which seems similar to what I want. Also this related question. And the FlatMap
TensorFlow GTL class. I have to read more about this. Maybe someone can also answer.