13

I have a set of pointers. In the first step, I insert data pointers, and in the second step, I iterate over the whole set and do something with the elements. The order is not important, I just need to avoid duplicates, which works fine with pointer comparison.

My question is, whether it might be advantageous to use an unordered set for the same purpose. Is insertion faster for an unordered set?

Fabian
  • 4,001
  • 4
  • 28
  • 59
  • 5
    "The order is not important" - once you decided on that, use `unordered_set`. The only advangtage of ordered containers is.. order. – Ami Tavory Feb 19 '16 at 12:11
  • How many elements are we talking about? And do you computation intensive work on each item or is it more like summing up / multiplying all elements? – MikeMB Feb 19 '16 at 12:40
  • 3
    Ordered containers have another important advantage is it can guarantee the time for each operation is O(lg n) while unordered ones requires O(n) in the worst case. So if you want to make promise about complicity, use std::set. – James Feb 19 '16 at 12:43
  • Related: [what is the difference between set and unordered_set in C++?](http://stackoverflow.com/questions/16075890/what-is-the-difference-between-set-and-unordered-set-in-c) – Cody Gray - on strike Feb 19 '16 at 12:52
  • @James: For which operations does this apply, for example? In my usecase, I restrict myself to clear(), insert() and iteration. – Fabian Feb 19 '16 at 12:56
  • @Fabian: `insert ()` could take O(n), if all elements end up in the same bucket, but with pointers that would be really bad luck – MikeMB Feb 19 '16 at 13:06
  • @Fabian The behavior of the hash table should be similar to a linked list if there are a lot of collisions. So insert() will try to visit a linked list until find the element or reach the end.( O(n) ). For clear() and iteration(assuming this means visiting all elements), both are O(n). The operation in my previous comment mainly refers to insert, find and delete. – James Feb 19 '16 at 13:32

1 Answers1

9

As Ami Tavory commented, if you don't need order, then it's usually best to go for unordered containers. The reason being that if order somehow improved performance, unordered containers would still be free to use it, and hence get the same or better complexity anyhow.

A downside of unordered collections is that they usually require a hash function for the key type. If it's too hard or expensive to make one, then containers which don't use hashes might be better.

In C++'s standard library, the average insertion complexity for std::set is O(log(N)), whereas for std::unordered_set it's O(1). Aside from that, there are probably less cache misses on average when using std::unordered_set.

At the end of the day though, this is just theory. You should try something that sounds good enough and profile it to see if it really is.

Yam Marcovic
  • 7,953
  • 1
  • 28
  • 38
  • 1
    For pointers, which is what the question is about, there is a specialisation of ``std::hash`` in the standard library, so nothing to worry about. – Jonas Schäfer Feb 19 '16 at 12:43
  • Another thing to consider, is that behaviour of comparison of unrelated pointers (not pointing to same array or into same object) is undefined. So, strictly speaking, it is unsafe to store pointers in `std::set` – Revolver_Ocelot Feb 19 '16 at 12:54
  • Since you mention complexity: worst case lookup isn't relevant for the original question, but O(N) for unordered_set can be a reason to choose an ordered set instead. Another argument *for* defaulting to unordered is expression of intent: the reader sees that you don't care about order. – peterchen Feb 19 '16 at 13:00
  • 1
    @Revolver_Ocelot Actually, I believe that in c++ the result is only unspecified. Also std::set uses std::less, which guarantees a total order even for pointers – MikeMB Feb 19 '16 at 13:03
  • @MikeMB , it is undefined, as in _omits any explicit definition of behavior_ ( __1.3.24__ ), but thanks for pointing to total order requirement for pointers by `std::less` – Revolver_Ocelot Feb 19 '16 at 13:08
  • 1
    Probably relevant: Chandler Carruth (works on clang and compiler optimizations) said that these days with cache behaviors, `std::unordered_map` should be the preferred default over `std::map`. He pointed out that each time he saw `std::map` in the code, it was a red flag for him in terms of performance. – Yam Marcovic Feb 19 '16 at 13:17
  • @Revolver_Ocelot: Are you sure? **§ 5.9** explicitly says the result is unspecified, which I though means it yields a valid value (in this case either true or false) the standard just doesn't specify which one. The situation is different in C by the way. – MikeMB Feb 19 '16 at 13:31
  • @MikeMB It says _"Otherwise, if a pointer p compares greater than a pointer q, ..."_ Preceding paragraph defines comparison only for limited range of pointers. Last point probably relates to equivalent but not equal pointers. It might depend on reading, though. Anyway, _unspecified_ means that result can be different for two comparisons of same two pointers. So it is still not safe to use < for arbitrary pointers. – Revolver_Ocelot Feb 19 '16 at 13:45
  • If memory consumption is an issue, std::map may be be better way to go, see [gcc-hash-map-vs-unordered-map](http://fgda.pl/post/7/gcc-hash-map-vs-unordered-map). 220kB vs 10MB, but its a bit dated, so I'm not sure wheter the values still hold . – Thomas Feb 19 '16 at 13:49
  • 2
    @Revolver_Ocelot: No argument there. All I said was that it is unspecified and not undefined (of which I'm still convinced) and - more importantly - it is safe to use pointers in `std::set`. I think we agree on the latter, which makes the former somewhat OT, but if you want, we can discuss this in chat. – MikeMB Feb 19 '16 at 17:27