C++ std::map or std::set - efficiently insert duplicates

Question

I have a bunch of data full of duplicates and I want to eliminate the duplicates. You know, e.g. [1, 1, 3, 5, 5, 5, 7] becomes [1, 3, 5, 7].

It looks like I can use either std::map or std::set to handle this. However I'm not sure whether it's faster to (a) simply insert all the values into the container, or (b) check whether they already exist in the container and only insert if they don't - are inserts very efficient? Even if there's a better way... can you suggest a fast way to do this?

Another question - if the data I'm storing in them isn't as trivial as integers, and instead is a custom class, how does the std::map manage to properly store (hash?) the data for fast access via operator[]?

A `set` would be more suitable since you don't need an associated value with each element. I'm going to guess that checking and then inserting into the set will be slower than just inserting because you would have to essentially be doing two key lookups in the former. — GWW, Oct 10 '12 at 18:58
By definition either of those will check *for you* when perform the insert. I.e. they will do what you would otherwise with some other container: check for existance. Personally, I'd go with the set unless you're purposely mapping something to something else. — WhozCraig, Oct 10 '12 at 18:59
Is the data always sorted? Because it looks like you want [std::unique](http://msdn.microsoft.com/en-us/library/9f5eztca(v=vs.100).aspx), not a new container — Mooing Duck, Oct 10 '12 at 19:00
No it's not sorted. However I need a container in order to return results from the original data set (which I must keep intact). — Gigi, Oct 10 '12 at 19:11
Thanks everyone for your answers. Unfortunately I cannot mark them all. :) — Gigi, Oct 10 '12 at 19:16

score 11 · Accepted Answer · edited Oct 10 '12 at 19:03

std::map doesn't use hashing. std::unordered_map does, but that's C++11. std::map and std::set both use a comparator that you provide. The class templates have defaults for this comparator, which boils down to an operator< comparison, but you can provide your own.

If you don't need both a key and a value to be stored (looks like you don't) you should just use a std::set, as that's more appropriate.

The Standard doesn't say what data structures maps and sets use under the hood, only that certian actions have certain time complexities. In reality, most implementations I'm aware of use a tree.

It makes no difference time-complexity-wise if you use operator[] or insert, but I would use insert or operator[] before I did a search followed by an insert if the item isn't found. The later would imply two seperate searches to insert an item in to the set.

score 7 · Answer 2 · answered Oct 10 '12 at 19:00

7

An insert() on any of the associated containers does a find() to see if the object exists and then inserts the object. Simply inserting the elements into an std::set<T> should get rid of the duplicates reasonably efficiently.

Depending on the size of your set and the ratio of duplicates to unique values, it may be faster to put the objects into std::vector<T>, std::sort() then, and then use std::unique() together with std::vector<T>::erase() to get rid of the duplicates.

answered Oct 10 '12 at 19:00

Dietmar Kühl

150,225
13
225
380

*"`insert()` [...] does a `find()` [but if not found] inserts..."* - the code-style formatting of `find()` there might be taken by some readers as a call to the `find()` API call, whereas `insert(x)` implementations won't literally use `.find(x)` as when not present there's no record of (iterator to) where the search was abandoned, which is needed to skip another O(logN) tree traveral for the actual insertion. You could get close with `lower_bound` followed by the `insert` overload using an iterator `hint`, but `insert` implementations will handle this internally for optimal performance. – Tony Delroy Jul 30 '15 at 05:49

score 2 · Answer 3 · answered Oct 10 '12 at 19:00

How many times should you do it?

If insert is usual:

//*/
std::set<int> store;
/*/
// for hash:
std::unordered_set<int> store;
//*/
int number;

if ( store.insert(number).second )
{
  // was not in store
}

If you fill once:

std::vector<int> store;
int number;

store.push_back(number);
std::sort(store.begin(),store.end());
store.erase(std::unique(store.begin(),store.end()),store.end() );

// elements are unique

score 0 · Answer 4 · answered Oct 10 '12 at 18:59

Assuming the common implementation strategy for std::map and std::set, i.e. balanced binary search trees, both insertion and lookup have to do a tree traversal to find the spot where the key should be. So failed lookup followed by insertion would be roughly twice as slow as just inserting.

how does the std::map manage to properly store (hash?) the data for fast access via operator[]?

By means of a comparison function that you specify (or std::less, which works if you overload operator< on your custom type). In any case, std::map and std::set are not hash tables.

score 0 · Answer 5 · answered Oct 10 '12 at 19:01

std::set and std::map are both implemented as red black tree as far as I know. And probably using only insertion would be faster (then both because you would be doubling the lookup time).

Also map and set use operator <. As long as your class has defined operator < It would be able to use them as keys.

C++ std::map or std::set - efficiently insert duplicates

5 Answers5

Linked