1

This question is language agnostic. But let's use C and some pseudo code for demonstration.

I want to store a hash-map / hash-table / dictionary / key-value store inside a static array (or optionally multiple static arrays), i.e. I have some upper limit about the number of entries, and have this data structure:

Key keys[N];
Value values[N];

In my case, I have typedef int Key maybe answers to this can be more generic. So the implicit dictionary this would represent is (Python syntax):

{keys[i]: values[i] for i in range(N)}

I want that accessing the dict is fast (average O(1)) (for both keys inside the dict, and also keys not inside the dict), and optimally also writing into it, although I will read from it a lot more. Also, in my use case, I would know in advance the dict I want to write. So let's say the D is the known dict. I would choose N = len(D) (numbers of entries in D), although maybe that's already not optimal.

A naive way would be to use some hash function, hash : Key -> uint, e.g. this (for Key == int), and then do this (pseudo code):

def init_by_dict(D):
    keys = unset  # all keys are unset initially
    max_hops = 0
    for key, value in D:
        i = hash(key) % N
        cur_max_hops = 0
        while keys[i] is set:
            i = (i + 1) % N
            cur_max_hops += 1
        max_hops = max(max_hops, cur_max_hops)
        keys[i] = key
        values[i] = value

Then a query would look like:

def query(key):
    i = hash(key) % N        
    cur_max_hops = 0
    while keys[i] is set and cur_max_hops <= max_hops:
        if keys[i] == key:
            return values[i]
        i = (i + 1) % N
        cur_max_hops += 1
    return unset

So in worst case it would be O(max_hops). I.e. you would want to keep max_hops low. The naive way for init_by_dict might already have this property if the hash function is good, but I don't know.

That's basically my question: What are efficient implementations of init_by_dict and query for the given data structure?

I also found this blog post which discusses also a very similar setting. And maybe also this question. But I would like to know if there are some well-known theoretical / state-of-the-art approaches on this problem. I also read about open addressing which seems similar to what I want. Also this related question. And the FlatMap TensorFlow GTL class. I have to read more about this. Maybe someone can also answer.

Albert
  • 65,406
  • 61
  • 242
  • 386
  • If the set of keys is immutable, you could you use a [minimal perfect hash function](https://github.com/thomasmueller/minperf), but maybe that's overly complicated for your case. It depends on performance and size you are looking for (how many empty entries are OK? how fast does it need to be? how many entries do you expect?) – Thomas Mueller Mar 28 '18 at 07:36
  • @ThomasMueller: In some use cases, I need to add entries afterwards, so then the set of keys is not immutable. I want to store maybe 1k-10k entries, and ideally I want only maybe 10-20% empty entries. – Albert Mar 28 '18 at 09:18
  • Do you add all entries in one step? In that case you could still use a MPHF (minimal perfect hash function). But if you expect inserts to be O(1), then maybe you should look at [hopscotch hashing](https://en.wikipedia.org/wiki/Hopscotch_hashing), which allows for high load factors (>0.9). Or some form of cuckoo hashing (but getting such a high load factor is a bit harder there). – Thomas Mueller Mar 28 '18 at 09:48
  • @ThomasMueller: In some situations, I will add all entries in one step, and but in others cases, incrementally. Hopscotch hashing looks interesting, I will look into it. – Albert Mar 28 '18 at 17:58

0 Answers0