2

I need to be able to store and lookup generic strings. I don't know much about the content of the strings, a little more then 2/3 are human language words with the rest being something closer to a UUID or number/letter combo. I know that any particular grouping will be constant (ie if it has some human words it will be all human words, if it has some UUID all the contents will be UUIDs etc).

I need to decide if I should place this data in a map or a hashmap to get the best average lookup rate. I'm inclined to say map with the O(log n) runtime because I don't believe I can make a proper efficient hash for strings when I know so little about their input format. Any thoughts as to which would be better?

EDIT: I forgot one key aspect. I don't know the length of the strings and so am concerned memory usage may grow too lage for long strings. If I used the hash method I would do something where after X characters the hash doesn't hash on a per-character basis to avoid the memory consumption being too huge.

What I would really like is a hash map implementation that keeps multuple values in the 'bucket' sorted in an ordered manaer so it can offer a (log N) search of the buckets; but I don't think that exists in stardrd C++ and it's not worth writeing from scratch.

pps. the data is near-static. which I'll occasionally have to add to the list it's rare and I'm willing to accept a slow write time. I only care about the lookup time.

dsollen
  • 6,046
  • 6
  • 43
  • 84

4 Answers4

5

It is difficult to make a single recommendation. It depends on several tradeoffs (type of iteration, memory vs lookup). Throughout I assume that you can use a C++11 compiler (or the equivalent Boost or TR1 libraries).

If the insertion/lookup times are the most important to you, I would definitely use std::unordered_set (see reference) with std::hash<std::string> (see reference). Both insertion and lookup are O(1) on average (amortized constant). If

Note that the unordered hash containers do not allow you to do iteration in sorted order. So if you want sorted iteration, then you can use the ordered container std::set<std::string>, but the price you pay is O(log N) lookup/insertion.

Memory constraints are more difficult to analyze. First, the ordered containers std::set and std::map need roughly 3 words per element overhead to maintain a tree structure that allows the ordered iteration. The unordered hash containers, however, have some spare capacity as hash containers operate very poorly on a full load factor.

#include <iostream>
#include <functional>
#include <string>
#include <unordered_set> // or <set> for ordered lookup

int main()
{
    // or std::set<std::string> for ordered lookup
    std::unordered_set<std::string> dictionary; 

    std::string str = "Meet the new boss...";
    dictionary.insert(str);
    auto it = dictionary.find(str);

    std::cout << *it << '\n';
}

Output on Ideone. If you also want to store Value alongside the std::string, then you can use a std::unordered_map<std::string, Value>, or std::map<std::string, Value> with the same hash function.

Conclusion: it is best to measure what works best for your application, depending on the tradeoffs indicated above.

TemplateRex
  • 69,038
  • 19
  • 164
  • 304
  • 1
    You can also use unordered_set and unordered_map if you have an implementation of TR1. I'm pretty sure they're also in Boost. – Pete Becker Aug 14 '12 at 18:25
  • 1
    I feel as if your follow up answer ignores the more important memory cost for an unordered set. Yes there is more overhead for individual elements within a map, but the difference is that the tree only contains valid data. With a hash map solution there will be buckets that don't comtain any data at all that still consume memory despite being empty. In a situation where your hash function give spares/spread out results you have to choose between two options. A small hashmap that ends up with large buckets and risks being close to the worst case N time, or lose a bit of mem to empty buckets – dsollen Aug 15 '12 at 14:26
  • @dsollen That's a very good point. I don't know the exact memory expansion strategy for `std::unordered_map`. I think it might be the same as for `std::vector` (capacity about 1.5 times current size). OTOH, if you have very long strings, then `std::map` still has to store those as well, instead of a 32- or 64-bit hash key. – TemplateRex Aug 15 '12 at 14:46
  • I think the string has to be stored in both ordered and unordered. Imagine I have a Hash A leading to Bucket A. In the bucket I have two values (a and b) stored. If I want to look up a value I can use a string to get a hash and find bucket A using the hash. But I still need to determine if value a or b (or neither) is the value that should be returned. The only way of doing that is looking at the full non-hashed string key for value a and b to see if they match my fetch key. So without knowing exact implimation I would assume the hashmap still must store the string in addition to hash. – dsollen Aug 15 '12 at 18:27
  • @dsollen You are right, and I already changed that in my previous edit. It's only 3 word per element vs spare capacity that plays a role in the memory tradeoff – TemplateRex Aug 15 '12 at 18:34
3

Apart from std::set, std::map, std::unordered_set and std::unordered_map - I would also consider studying Tries to see if they would be a better fit:

http://en.wikipedia.org/wiki/Trie

Andrew Tomazos
  • 66,139
  • 40
  • 186
  • 319
  • tries do look interesting and potentially worth considering, but i'm not willing to expend too much effort in potentially pre-mature optimization by writing a data structure from scratch. Does C++ (or boost) already contain a tries implementation? – dsollen Aug 14 '12 at 18:20
  • @duedl0r: I am not sure that is correct. How did you come to that conclusion? Do you have a reference? – Andrew Tomazos Aug 14 '12 at 18:33
  • just an uneducated guess.. somehow you have to store the pointers, so every byte of the data will be 5 times bigger (4 byte pointer). Of course there are methods to compress that, but I wouldn't dare to implement those – duedl0r Aug 14 '12 at 20:51
  • @duedl0r: You haven't thought this through. There is one pointer per string just like in a binary tree or hash table, except that it points at the strings unique suffix rather than the entire string. Further as common prefixes are only stored once, overall it actually takes less space than a binary tree or hash table. – Andrew Tomazos Aug 14 '12 at 21:07
1

Cedar, HAT-Trie, and JudyArray is quite awesome, you can find the benchmark here.

benchmark result

Kokizzu
  • 24,974
  • 37
  • 137
  • 233
0

You might want to have a look at the benchmark: http://www.dotnetperls.com/sorteddictionary It appears in real application inspite of collisions Dictionary is better than SortedDictionary.

Roman Dzhabarov
  • 521
  • 4
  • 10