2

People have asked similar questions about the efficiency of various data structures but none I have read are totally applicable to my scenario so I wondered if people had suggestions for one that was tailored to satisfy the following criteria efficiently:

  • Each element will have a unique key. There will be no possibility of collisions because each element hashes to a different key. EDIT: *The key is a 32-bit uint.*
  • The elements are all unique and therefore can be thought of as a set.
  • The only operations required are adding and getting, not deletion. These need to be quick as they will be used several 100,000 times in a typical run!
  • The order in which elements are kept is irrelevant.
  • Speed is more important than memory-consumption... though it can't be too greedy!

I am developing for a company that will use the program commercially so any third-party data structures should come with no copyright protection or anything, but if the STL has a data structure that will do the job efficiently then that would be perfect.

I know there are countless Hashmap/Dictionary style C++ data structures with implementations that are built to satisfy different criteria so if someone can suggest one ideal for this situation then that would be greatly appreciated.

Many thanks

Edit:

I found this passage on SO that seems to suggest unordered_map would be good?

hash_map and unordered_map are generally implemented with hash tables. Thus the order is not maintained. unordered_map insert/delete/query will be O(1) (constant time) where map will be O(log n) where n is the number of items in the data structure. So unordered_map is faster, and if you don't care about the order of the items should be preferred over map. Sometimes you want to maintain order (ordered by the key) and for that map would be the choice.

Darius
  • 5,180
  • 5
  • 47
  • 62
  • and what's the problem with using a normal std::map ? – Ioan Paul Pirau Jul 27 '11 at 08:53
  • I think you're overthinking this. I'd just go for hash map. @John std::map uses a red-black tree, and given the description above I think a hash table would be a better fit. How many elements will you be storing (approx)? Will you be adding and getting repeatedly, or will you add your elements and then get them? – salezica Jul 27 '11 at 08:54
  • When you say taht the order is irrelevant: do you always access in sequence or by the key? Is there anything you know about the key and how the key is structured? – Mario The Spoon Jul 27 '11 at 08:54
  • @John Paul I was wondering which is most efficient and suitable. Are you suggesting that map is? – Darius Jul 27 '11 at 08:54
  • @Mario The Spoon the key is a uint and I retrieve values using keys, and then once at the end of execution I iterate through all entries in the table. – Darius Jul 27 '11 at 08:56
  • @Santiago Lezica I will be adding and getting repeatedly during execution and the number of elements will probably be in the 1,000's if not 10's of 1,000's. – Darius Jul 27 '11 at 08:57
  • @Santiago, you are right the hashmap seems to be a better fit than the map, as it is faster. Thank you; Greenhouse.. I was just thinking that a map will fit well for your description, but as Santiago suggested, the hashmap should be faster : http://stackoverflow.com/questions/2189189/map-vs-hash-map-in-c – Ioan Paul Pirau Jul 27 '11 at 09:01

5 Answers5

2

Looks like a prefix tree (with element at each node end) also fits in this scenario. It's damn fast, even faster than hash map because no hash value calculation is done and getting a value is purely O(n) where n is the key length. It's a bit memory hungry but common prefix of keys are shared in the same node path.

EDIT: I assume the keys are string, not simple values like integers

LeleDumbo
  • 9,192
  • 4
  • 24
  • 38
  • O(n)? Are you sure? hash map is O(1), binary tree O(logN), so O(n) doesn't seem fast in comparison... – juanchopanza Jul 27 '11 at 09:18
  • @juanchopanza but your order of magnitude comment is to do with size of collection as the basis, whereas n here is key length – Darius Jul 27 '11 at 09:27
2

As for build-in solutions I'd recommand google::dense_hash_map. They are really fast especially for numeric keys. You'll have to decide on a specific key that will be reserved as "empty_key". Moreover here is a really nice comparison of different hash-map implementations.

An excerpt

Library         Linux-intCPU (sec)  Linux-strCPU (sec)   Linux PeakMem (MB)
glib            3.490               4.720                24.968
ghthash         3.260               3.460                61.232
CC’s hashtable  3.040               4.050                129.020
TR1             1.750               3.300                28.648
STL hash_set    2.070               3.430                25.764
google-sparse   2.560               6.930                5.42/8.54
google-dense    0.550               2.820                24.7/49.3
khash (C++)     1.100               2.900                6.88/13.1
khash (C)       1.140               2.940                6.91/13.1
STL set (RB)    7.840               18.620               29.388
kbtree (C)      4.260               17.620               4.86/9.59
NP’s splaytree  11.180              27.610               19.024

However, when setting a "deleted_key", this map can also perform deletions. So maybe it'll be possible to create a custom solution that is even more efficient. But apart from that minor point, any hash-map should exactly suit your needs (note that "map" is an ordered tree-map and thus slower).

b.buchhold
  • 3,837
  • 2
  • 24
  • 33
  • That was a very interesting read thank you. I am going to look in to google_dense_hash because for int keys it seems to outperform all the others in terms of the author's benchmarks – Darius Jul 27 '11 at 09:35
1

What you need definitely sounds like a hash set, C++ has this as either std::tr1::unordered_set or in Boost.Unordered.

P.S. Note, however, that TR1 is not yet standard, and you'll probably need to get Boost for the implementation.

TC1
  • 1
  • 3
  • 20
  • 31
  • If you are using a reasonably new gcc, then you have access to the unordered maps and sets with `g++ -std=c++0x`. – juanchopanza Jul 27 '11 at 09:19
  • @juanchopanza Didn't know that, haven't checked the gcc man for a while... Thanks. :) – TC1 Jul 27 '11 at 09:23
0

It sounds like std::unordered_set would fit the bill, but without knowing more about the key, it's difficult to say. I'm curious about how you can guarantee that there will be no possibility of collisions: this implies a small (less than the size of the table), finite set of keys. If this is the case, it may be more efficient to map the keys to a small int, and use std::vector (with empty slots for the entries not present).

James Kanze
  • 150,581
  • 18
  • 184
  • 329
  • Many thanks for your input. I have updated the question about the key, it is a uint. They are unique because I will look up a hashed value and if it already exists in the table I just use the stored value and not the new one. There is no linear probing or chaining involved because it is unnecessary. – Darius Jul 27 '11 at 09:02
  • @Greenhouse Gases If you can create an `std::vector` large enough for all of the possible entries, just using the `uint` directly will be the fastest solution. If the range of the `uint` isn't restricted, however, you'll have to map `[0...UINT_MAX]` into the size of your table, which means handling collisions. Any improvement on `std::unordered_set` would have to be based on knowledge of the distribution of the keys in practice. – James Kanze Jul 27 '11 at 09:21
0

What you're looking for is an unordered_set. You can find one in Boost, TR1, or C++0x. If you're hoping to associate the key with a value, then unordered_map does just that- also in Boost/TR1/C++0x.

Puppy
  • 144,682
  • 38
  • 256
  • 465