65

I came across this good question, which is similar but not at all same since it talks about Java, which has different implementation of hash-tables, by virtue of having synchronized accessor /mutators: What are the differences between a HashMap and a Hashtable in Java?

So what is the difference in C++ implementation of set and unordered_set? This question can be of course extended to map vs unordered_map and so on for other C++ containers.

Here is my initial assessment:

set: While the standard doesn't explicitly ask it to be implemented as trees, the time-complexity constraint asked for its operations for find/insert, means it will always be implemented as a tree. Usually as RB tree (as seen in GCC 4.8), which is height-balanced. Since they are height balanced, they have predictable time-complexity for find()

Pros: Compact (compared to other DS in comparison)

Con: Access time complexity is O(lg n)

unordered_set: While the standard doesn't explicitly asks it to be implemented as trees, the time-complexity constraint asked for its operations for find/insert, means it will always be implemented as a hash-table.

Pros:

  1. Faster (promises amortized O(1) for search)
  2. Easy to convert basic primitives to thread-safe, as compared to tree-DS

Cons:

  1. Look up not guaranteed to be O(1). Theoretical worst case is O(n).
  2. Not as compact as tree (for practical purposes load factors is never 1).

Note: The O(1), for hashtable comes from the assumption that there are no collision. Even with load-factor of .5, every second variable insertion is leading to collision. It could be observed that the load-factor of hash-table is inversely proportional to the number of operations required for accessing a element in it. More we reduce #operations, sparser hash-table. When the element stored are of size comparable to pointer, then overhead is quite significant.

Did I miss any difference between map/set for performance analysis that one should know?

Null
  • 1,950
  • 9
  • 30
  • 33
Ajeet Ganga
  • 8,353
  • 10
  • 56
  • 79
  • 2
    The elements of a `std::set` must be traversable in a specific order. This is the actual reason why the insertion, lookup, and removal operations are `O(lg n)`. – isekaijin Apr 18 '13 at 06:33
  • @EduardoLeón : I would think O(lg n), is the side effect of tree as the DS. That would also explain items have certain order when traversed. I am not sure, but I don't know 'specific order' is the requirement for the 'set' in C++. I could be wrong. – Ajeet Ganga Apr 18 '13 at 06:38
  • 3
    Just for clarity, is the *question* here essentially "Did I miss anything important or get anything wrong?" – WhozCraig Apr 18 '13 at 06:43
  • @NicolBolas I swear to infinite monkeys, that there exist 1000 numbers, which would produce same hash, for smartest hash-function alive. – Ajeet Ganga Apr 18 '13 at 06:44
  • 2
    What is the actual question here? "What's the difference between a set and an unordered set" seems to have been answered by yourself. – Xeo Apr 18 '13 at 06:47
  • 1
    @Xeo I used to think I know the obvious differences in Java between hashSet and hashMap, until I went through all the answers and I discovered hashset is synchronized in Java, so while it is by default synchronized, it is probably slower compared to map for small number of elements. I am just looking for "Did I miss anything I should have known ??" – Ajeet Ganga Apr 18 '13 at 06:51
  • 1
    You should've left your knowledge out of the question blurb and post it as an answer. If you are wrong, others may post a more correct answer, and if you are right, you just accept it by the end of the timer. – didierc Apr 18 '13 at 07:25
  • > Easy to convert basic primitives to thread-safe, as compared to tree-DS What do you mean? – balki Apr 18 '13 at 08:32
  • @balki To provide fine-grained locking for a tree is difficult - effectively, to update with multiple writers, you need to lock a large portion of the tree. Hashmaps on the other hand are simpler - you only need to lock each bucket while you work on it - so multiple updates to a hashmap should generally be less contentious, hence faster (assuming you're not trying to update values in the same bucket from each thread, in which case it effectively becomes sequential - sort of the like `O(1)` best case vs `O(n)` worst case for lookups, but instead for locking). – Yuushi Apr 18 '13 at 08:57
  • @Yuushi I see. However, as `std::unordered_set` doesn't provide any facility to lock a specific bucket, I think it doesn't matter for the current comparison between `std::set` and `std::unordered_set` – balki Apr 18 '13 at 11:57
  • @balki True, given the user will be forced to do their own locking for concurrent writes, it won't make any difference. – Yuushi Apr 18 '13 at 12:16
  • There's no need to use such a rude language, even when talking about yourself. Please be polite. – didierc Apr 18 '13 at 16:19

4 Answers4

30

I think you've generally answered your own question, however, this:

Not as compact as tree. (for practical purposes load factors is never 1)

is not necessarily true. Each node of a tree (we'll assume it's a red-black tree) for a type T utilizes space that is equal to at least 2 * pointer_size + sizeof(T) + sizeof(bool). This may be 3 * pointer size depending on whether the tree contains a parent pointer for each tree node.

Compare this to a hash-map: there will be wasted array space for each hash map due to the fact that load factor < 1 as you've said. However, assuming the hash map uses singly linked lists for chaining (and really, there's no real reason not to), each element inserted take only sizeof(T) + pointer size.

Note that this analysis ignores any overhead which may come from extra space used by alignment.

For any element T which has a small size (so, any basic type), the size of the pointers and other overhead dominates. At a load factor of > 0.5 (for example) the std::unordered_set may indeed use up less memory than the equivalent std::set.

The other big missing point is the fact that iterating through a std::set is guaranteed to produce an ordering from smallest to largest, based on the given comparison function, while iterating through an std::unordered_set will return the values in a "random" order.

Yuushi
  • 25,132
  • 7
  • 63
  • 81
  • @PeteBecker For (amortized) `O(1)` lookup, it's effectively forced to be an array of lists (or simply an array if cuckoo hashing is used) - the random access requirement to get to the specified bucket in `O(1)` will enforce this. If you were referring to what each bucket utilizes, read my post again, I explicitly specify I assume a singly linked list (although this is certainly not enforced by the standard). – Yuushi Apr 18 '13 at 12:14
  • I thought I had deleted that message, since it's a distraction. I've deleted it now. – Pete Becker Apr 18 '13 at 12:26
11

Another difference (though not performance-related) is that set insertion doesn't invalidate iterators, while unordered_set insertion can if it triggers a rehash. In practice it's a pretty minor concern, since references to the actual elements remain valid.

dhaffey
  • 1,354
  • 9
  • 12
  • How can that be since if the `set` is implemented as an rb-tree an insertion can trigger a tree rebalancing? – Björn Lindqvist Nov 04 '16 at 14:29
  • 3
    Because the iterators can be (and AFAIK, always are) implemented in terms of a pointer to an internal tree node. A rebalancing operation doesn't need to create or destroy nodes, it can just shuffle some left/right/parent pointers around. So afterward, a previously-valid iterator is left pointing at a valid node and can still get at everything it needs to traverse the tree. – dhaffey Nov 05 '16 at 00:57
2

Yuushi addresses spatial efficiency and other points well already; just a few other parts of the question I'll comment on...

The O(1), for hashtable comes from the assumption that there are no collision.

That's not true. What O(1) means is not that the first lookup attempt will always succeed, it's that there is - on average - a constant number of attempts needed, rather than something that grows as the number of values grows. For example, with an unordered_set or ..._map, the max_load_factor defaults to 1.0 on construction, and if load factor approaches that with a good hash function, the average number of elements that hash to any one bucket will be around 2 regardless of how many values are in the table.

Even with load-factor of .5, every second variable insertion is leading to collision.

True, but it doesn't get as dire as you might intuitively expect: that average chain length of 2 at 1.0 load factor's not bad.

It could be observed that the load-factor of hash-table is inversely proportional to the number of operations required for accessing a element in it. More we reduce #operations, sparser hash-table.

There's definitely a correlation (it's not inverse).

Tony Delroy
  • 102,968
  • 15
  • 177
  • 252
  • Can you achieve l.f. .5 w/o collisions at all? At least not for every second insertion? – Yola Feb 13 '18 at 08:26
  • @Yola: with a general purpose hash function that's unlikely to do poorly no matter the input, no, as the placement of each new item is effectively random but reproducible: for them half the buckets already being in use means 50/50 chance of a collision. In practice, many languages/libraries ship with hash implementations that simply pass integers through unchanged, so if the keys tend to increment, they'll map nicely into successive buckets and you often see less collisions (and more cache-friendliness) than you would with that general purpose random-but-repeatable hash. – Tony Delroy Feb 13 '18 at 08:30
  • At the extreme, programs like [`gperf`](https://www.gnu.org/software/gperf/) can normally generate source code to do perfect hashing (i.e. 0 collisions) for a set of constant keys, but that's no use for inputs unknown until run-time. – Tony Delroy Feb 13 '18 at 08:34
  • Ah, i see, that's if already have .5 l.f. than you have collision on every second insertion. Well, not on every second, less, but still, this wording is ok too. Thanks. – Yola Feb 13 '18 at 08:59
1

In some case set is more convenient.

For example using vector as key:

set<vector<int>> s;
s.insert({1, 2});
s.insert({1, 3});
s.insert({1, 2});

for(const auto& vec:s)
    cout<<vec<<endl;   // I have override << for vector
// 1 2
// 1 3 

The reason why vector<int> can be in set because vector override operator<.

But if you use unordered_set<vector<int>> you have to create a hash function for vector<int>, because vector does't have a hash function, so you have to define one like:

struct VectorHash {
    size_t operator()(const std::vector<int>& v) const {
        std::hash<int> hasher;
        size_t seed = 0;
        for (int i : v) {
            seed ^= hasher(i) + 0x9e3779b9 + (seed<<6) + (seed>>2);
        }
        return seed;
    }
};

vector<vector<int>> two(){
    //unordered_set<vector<int>> s; // error vector<int> doesn't  have hash function
    unordered_set<vector<int>, VectorHash> s;
    s.insert({1, 2});
    s.insert({1, 3});
    s.insert({1, 2});

    for(const auto& vec:s)
        cout<<vec<<endl;
    // 1 2
    // 1 3
}

you can see that in some case unordered_set is more complicated.

Mainly cited from: https://stackoverflow.com/a/29855973/6329006

More difference between unordered_set and set see this: https://stackoverflow.com/a/52203931/6329006

Jayhello
  • 5,931
  • 3
  • 49
  • 56