5

I have an efficiency critical application, where I need such an array-type data structure A. Its keys are 0, 1, 2,..., and its values are uint64_t distinct values. I need two constant operations:

1. Given i, return A[i];
2. Given val, return i such that A[i] == val

I prefer not to use hash table. Because I tried GLib GHashTable, it took around 20 mins to load 60 million values into the hash table (If I remove the insertion statement, it took only around 6 seconds). The time is not acceptable for my application. Or maybe somebody recommend other hash table libraries? I tried uthash.c, it crashed immediately.

I also tried SDArray, but it seems not the right one.

Does anybody know any data structure that would fulfill my requirements? Or any efficient hash table implementations? I prefer using C/C++.

Thanks.

Joy
  • 9,430
  • 11
  • 44
  • 95
  • 3
    Try `std::unordered_map` if you have C++11 available. In general, you need *two* hash tables, one for key-value lookup and one for value-key lookup. Of course, you have to insert new entries in both tables. – leemes Apr 03 '13 at 09:25
  • 1
    @leemes: Write an answer! – Lightness Races in Orbit Apr 03 '13 at 09:26
  • Make sure you reserve enough room in the hash table before you start adding entries. Eg, call `my_unordered_map.reserve(6e7)` (6e7 == 60 million) if you are using `std::unordered_map` – Nicu Stiurca Apr 03 '13 at 09:27
  • @Zeta I would prefer C, but if cannot, I can accept C++. I am in the middle of moving from C to C++. – Joy Apr 03 '13 at 09:28
  • Those 20 minutes, did they include the time spent reading data from the disk or did they only include hash table operations? – Alexey Frunze Apr 03 '13 at 09:36
  • 1
    @Alexey_Frunze Just the hash table insertion. If I remove the insertion, the loading takes around 6 secs only. My key is uint64_t, and the value is pointer. – Joy Apr 03 '13 at 09:38
  • It may be useful to have this info in the question. – Alexey Frunze Apr 03 '13 at 09:43

2 Answers2

6

In general, you need two hash tables for this task. As you know, hash tables give you a key look-up in expected constant time. Searching for a value requires iterating through the whole data structure, since information about the values isn't encoded in the hash look-up table.

Use two hash tables: One for key-value and one (reversed) for value-key look-up. In your particular case, the forward search can be done using a vector as long as your keys are "sequential". But this doesn't change the requirement for a data structure enabling fast reverse look-up.

Regarding the hash table implementation: In C++11, you have the new standard container std::unordererd_map available.

An implementation might look like this (of course this is tweakable, like introducing const-correctness, calling by reference etc.):

std::unordered_map<K,T> kvMap; // hash table for forward search
std::unordered_map<T,K> vkMap; // hash table for backward search

void insert(std::pair<K,T> item) {
    kvMap.insert(item);
    vkMap.insert(std::make_pair(item.second, item.first));
}

// expected O(1)
T valueForKey(K key) {
    return kvMap[key];
}

// expected O(1)
K keyForValue(T value) {
    return vkMap[value];
}

A clean C++11 implementation should "wrap" around the key-value hash map, so you have the "standard" interface in your wrapper class. Always keep the reverse map in sync with your forward map.

Regarding the creation performance: In most implementations, there is a way to tell the data structure how much elements are going to be inserted, called "reserve". For hash tables, this is a huge performance benefit, as dynamically resizing the data structure (which happens during insertions every now and then) completely re-structures the whole hash table, as it changes the hash function itself.

leemes
  • 44,967
  • 21
  • 135
  • 183
  • In general, I'd agree. But in this specific case, wouldn't it be better to us an array (or `std::vector`) for the forward search? That should use significantly less RAM and should be faster. – user9876 Apr 03 '13 at 09:38
  • You are right, I didn't see that the keys are * sequential*. Of course, in this case you should use a container which is optimized for such cases, which is the vector. – leemes Apr 03 '13 at 09:39
  • 1
    @leemes: Shouldn't it be `unordered_map` for the second map? – Zeta Apr 03 '13 at 09:54
  • @leemes Thanks so much! I tried the `unordered_map`, it takes only 40 seconds to load the 60 million values. You save my life. – Joy Apr 03 '13 at 09:58
  • @leemes But, I still do not understand why GLib GHashTable cannot perform this task well. Is `unordered_map` not kind of hash table? How is its implementation detail? – Joy Apr 03 '13 at 09:59
  • @Zeta: Of course, thanks ;) @ Cai: This depends on your compiler / C++ standard library implementation, but it is most probably a hash map. (The standard only defines the interface, external behavior and some complexity constraints, i.e. expected insertion and look-up.) Did you try the reservation in advance of the expected final size? – leemes Apr 03 '13 at 13:10
  • 1
    @CaiShaojiang: Two reasons: First, `GHashTable` is a C library. C++ is a more powerful language. The C code has to work through pointers which isn't as fast as C++ templates. Secondly, `unordered_map` comes from [Boost](http://www.boost.org) which is a very competent group of software developers. – MSalters Apr 03 '13 at 13:10
  • @Zeta I did search for similar code to reserve space for `GHashTable`, but cannot find. – Joy Apr 03 '13 at 13:44
  • @leemes I would like to confirm with you about the query performance of `unordered_map`. Check this http://stackoverflow.com/questions/3902644/choosing-between-stdmap-and-stdunordered-map. It says the query time is `O(lgN)`, not constant. – Joy Apr 04 '13 at 01:43
  • @leemes You mentioned that the `unordered_map` comes from `Boost`. After searching, I found that there are `std::unordered_map` and `boost::unordered_map`. I think your answer means `boost::unordered_map`? – Joy Apr 04 '13 at 01:51
  • It wasn't me who said this. It *comes* from boost, but it made its way into the standard specification (and thus into most standard library implementatons) since C++11. It's basically the same thing, so if you want to stay compatible with the older standard version C++03, use boost's implementation, otherwise use the standard version. – leemes Apr 04 '13 at 06:45
0

I would go for two vectors (assuming that your values are really distinct), as this is O(1) in access where map is O(log n) in access

vector<uint64_t> values;
vector<size_t> keys

values.reserve(maxSize); // do memory reservation first, so reallocation doesn't occur during reading of data
keys.reserve(maxSize); // do memory reservation first, so reallocation doesn't occur during reading of data

Then, when reading in data

values[keyRead] = data;
keys[valueRead] = key;

Reading information is then the same

data = values[currentKey];
key = keys[currentData];
ogni42
  • 1,232
  • 7
  • 11