40

Profiling my cpu-bound code has suggested I that spend a long time checking to see if a container contains completely unique elements. Assuming that I have some large container of unsorted elements (with < and = defined), I have two ideas on how this might be done:

The first using a set:

template <class T>
bool is_unique(vector<T> X) {
  set<T> Y(X.begin(), X.end());
  return X.size() == Y.size();
}

The second looping over the elements:

template <class T>
bool is_unique2(vector<T> X) {
  typename vector<T>::iterator i,j;
  for(i=X.begin();i!=X.end();++i) {
    for(j=i+1;j!=X.end();++j) {
      if(*i == *j) return 0;
    }
  }
  return 1;
}

I've tested them the best I can, and from what I can gather from reading the documentation about STL, the answer is (as usual), it depends. I think that in the first case, if all the elements are unique it is very quick, but if there is a large degeneracy the operation seems to take O(N^2) time. For the nested iterator approach the opposite seems to be true, it is lighting fast if X[0]==X[1] but takes (understandably) O(N^2) time if all the elements are unique.

Is there a better way to do this, perhaps a STL algorithm built for this very purpose? If not, are there any suggestions eek out a bit more efficiency?

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
Hooked
  • 84,485
  • 43
  • 192
  • 261
  • 1
    Should the container be allowed to contain duplicates at all? Perhaps you need a set, not a vector? –  May 04 '10 at 21:48
  • 2
    Your implementation if `is_unique` would be faster if it took as `const vector&` as its argument instead of accepting its argument by value. That way, you avoid making a copy of the vector and then also copying that copy into a set. – Tyler McHenry May 04 '10 at 21:49
  • @ Neil, the container needs random access (hence the vector) for other parts of the code. – Hooked May 04 '10 at 21:55
  • 1
    @Hooked If possible, you may want to consider keeping the vector sorted all the time as you construct it. That would make `is_unique` (whether implemented by `std::set` or `std::unique`) run in linear time. By keeping the vector sorted, you spread out the work over time, and have to "pay" for some of the work only once per element, rather than taking a big hit by having to re-compute everything each time you call `is_unique`. – Tyler McHenry May 04 '10 at 21:57
  • 1
    Can you define a hash function on your element type? If so, using a hash-table instead of a binary-tree based `set` should work very well, particularly if you test for membership on every insert instead of adding everything first and then checking. Note that standard STL doesn't have any hash-based containers, although `hash_set` is present as an extension; or you could use Boost's. – tzaman May 04 '10 at 23:59
  • 1
    In addition to tzaman's comment, I think you should also consider the average size of your dataset. Are you spending time checking for collisions in thousands of vectors of dozens of entries each, or dozens of vectors of thousands of entries each? Read any paper on search or sort algorithms written in the last 15 years, and it's hard not to find one that talks about how this or that algorithm outpaces the current champion by such and such percent in a particular list size range (say, 10k to 50k entries). – Jason May 05 '10 at 05:05

11 Answers11

29

Your first example should be O(N log N) as set takes log N time for each insertion. I don't think a faster O is possible.

The second example is obviously O(N^2). The coefficient and memory usage are low, so it might be faster (or even the fastest) in some cases.

It depends what T is, but for generic performance, I'd recommend sorting a vector of pointers to the objects.

template< class T >
bool dereference_less( T const *l, T const *r )
 { return *l < *r; } 

template <class T>
bool is_unique(vector<T> const &x) {
    vector< T const * > vp;
    vp.reserve( x.size() );
    for ( size_t i = 0; i < x.size(); ++ i ) vp.push_back( &x[i] );
    sort( vp.begin(), vp.end(), ptr_fun( &dereference_less<T> ) ); // O(N log N)
    return adjacent_find( vp.begin(), vp.end(),
           not2( ptr_fun( &dereference_less<T> ) ) ) // "opposite functor"
        == vp.end(); // if no adjacent pair (vp_n,vp_n+1) has *vp_n < *vp_n+1
}

or in STL style,

template <class I>
bool is_unique(I first, I last) {
    typedef typename iterator_traits<I>::value_type T;
    …

And if you can reorder the original vector, of course,

template <class T>
bool is_unique(vector<T> &x) {
    sort( x.begin(), x.end() ); // O(N log N)
    return adjacent_find( x.begin(), x.end() ) == x.end();
}
Potatoswatter
  • 134,909
  • 25
  • 265
  • 421
  • Putting things in as pointers is not a bad idea. This could be applied to either of the set based algorithms. – clahey May 04 '10 at 21:57
  • 2
    `std::ajacent_find` is a much better idea than `std::unique`. – James McNellis May 04 '10 at 22:21
  • This is, by far, the consistently fastest (or nearly so) example posted. Would it be to much to ask you to elaborate on the "STL-style" so that I (and others) could see how you would do it? Also running your first example gives the "error: invalid conversion from ‘const int*’ to ‘int*’" on the &x, removing the const seemed to do the trick. – Hooked May 04 '10 at 22:24
  • Am I missing something? This edit gives the error: `error: no match for call to (std::binary_negate) (int*&, int*&)’ and does not compile. – Hooked May 04 '10 at 23:00
  • @Hooked: sorry, I really should've tested it. There was a const correctness error (missing a few consts rather than having one extra) and an issue with the functor being incompatible with `not2`… here I switched to the `ptr_fun` functor generator. – Potatoswatter May 04 '10 at 23:18
  • The "STL-style" prototype is more like the way the functions in `` are written. It takes a pair of iterators rather than a container. Just substitute in those three lines and you should be good to go. (But I didn't test it ;v) .) – Potatoswatter May 04 '10 at 23:21
  • 1
    Faster Big-O can be achieved by using a hash-table based `set` implementation, at the cost of some space overhead. O(1) inserts for n elements == O(n) – tzaman May 04 '10 at 23:25
  • @tzaman: correct, I still forget about hashtables! However the question specifies `operator<` rather than a hash function. – Potatoswatter May 04 '10 at 23:37
10

You must sort the vector if you want to quickly determine if it has only unique elements. Otherwise the best you can do is O(n^2) runtime or O(n log n) runtime with O(n) space. I think it's best to write a function that assumes the input is sorted.

template<class Fwd>
bool is_unique(In first, In last)
{
    return adjacent_find(first, last) == last;
}

then have the client sort the vector, or a make a sorted copy of the vector. This will open a door for dynamic programming. That is, if the client sorted the vector in the past then they have the option to keep and refer to that sorted vector so they can repeat this operation for O(n) runtime.

Brian Neal
  • 31,821
  • 7
  • 55
  • 59
wilhelmtell
  • 57,473
  • 20
  • 96
  • 131
7

The standard library has std::unique, but that would require you to make a copy of the entire container (note that in both of your examples you make a copy of the entire vector as well, since you unnecessarily pass the vector by value).

template <typename T>
bool is_unique(std::vector<T> vec)
{
    std::sort(vec.begin(), vec.end());
    return std::unique(vec.begin(), vec.end()) == vec.end();
}

Whether this would be faster than using a std::set would, as you know, depend :-).

James McNellis
  • 348,265
  • 75
  • 913
  • 977
  • 1
    Asymptotically speaking, this has the same space and time requirements as the `is_unique` in the question. They are both O(n) in space and O(n log n) in time. That is to say, their run times are dominated by sorting (explicit sorting in your example, and the sorting internal to `std::set` in the OP's). My suggestion would be to try both and pick whichever happens to be faster in practice. – Tyler McHenry May 04 '10 at 21:53
  • 1
    I must be getting old. Why couldn't I see that? You didn't slip in an edit just under the 5 min. mark, did you? – Fred Larson May 04 '10 at 21:55
  • @Tyler actually, no, worse, this is O(n2) runtime and O(n) space. The _worst case_ of `std::sort()` is O(n2). – wilhelmtell May 04 '10 at 22:04
  • 1
    @WilhelmTell: _Technically_ `sort()` has no worst-case requirement at all; it does have the average-case n log n requirement, and if you really thought you'd run into perverse cases where you're not going to get real-world n log n, you can use `stable_sort()`, which does have an n log n upper bound requirement. – James McNellis May 04 '10 at 22:07
  • @James correct: 25.3.1.1 paragraph 3. The only guarantee is that the average is N*log N. The cplusplus.com website mislead me! http://www.cplusplus.com/reference/algorithm/sort/ says worst case of sort is O(n^2) – wilhelmtell May 04 '10 at 22:12
  • @WilhelmTell of Purple-Magenta: That's because `std::sort` can use something like introsort which has O(n^2) complexity, while `std::stable_sort` is usually implemented using a form of merge sort. – Billy ONeal May 04 '10 at 22:53
  • @WilhelmTell: I've come to the conclusion that cplusplus.com is of pretty poor quality. I've reported a half dozen errors or so over the last few months and none of them have been corrected. – James McNellis May 04 '10 at 23:18
  • 1
    This is stupid fast for a simple type ==, far faster then any other example (the copying is so cheap). In general this may not be the case for me, but I'll have to keep this mind for future use! – Hooked May 04 '10 at 23:33
  • @James Seriously? Making a mistake is not nearly as bad as not correcting the mistake! Do they reply at least? – wilhelmtell May 05 '10 at 00:04
  • @WilhelmTell: I got no response at all, which is what irritated me. I don't mind reporting bugs and explaining why they need to be corrected, but if they're not going to fix them... eh. That said, I still use cplusplus.com when I need a quick lookup on something :-P – James McNellis May 05 '10 at 01:23
  • @James truth, same here. Even if Google works by voting with the mouse or the HTML. :s – wilhelmtell May 05 '10 at 01:28
  • The worst case of std::sort is nlogn as explained here https://stackoverflow.com/a/51389544/5649936 – Šimon Hrabec Aug 09 '22 at 09:23
7

For one thing you could combine the advantages of both: stop building the set, if you have already discovered a duplicate:

template <class T>
bool is_unique(const std::vector<T>& vec)
{
    std::set<T> test;
    for (typename std::vector<T>::const_iterator it = vec.begin(); it != vec.end(); ++it) {
        if (!test.insert(*it).second) {
            return false;
        }
    }
    return true;
}

BTW, Potatoswatter makes a good point that in the generic case you might want to avoid copying T, in which case you might use a std::set<const T*, dereference_less> instead.


You could of course potentially do much better if it wasn't generic. E.g if you had a vector of integers of known range, you could just mark in an array (or even bitset) if an element exists.

Community
  • 1
  • 1
UncleBens
  • 40,819
  • 6
  • 57
  • 90
  • But that's still very expensive because `set` uses dynamic allocations. You're essentially building a `set` when you don't need it. So the solution is correct mathematically, but expensive in practice. – wilhelmtell May 04 '10 at 22:01
  • @WilhelmTell: Yes, but when you start by sorting the vector, you are gonna have to sort it all, which falls under the same worst case scenario as OP's #1. Also, sorting a vector can be expensive, if T is expensive to swap. - This is about finding a middle way between the worst cases of either approach. All in all, it will depend heavily on the types involved and the nature of the data - how often there are or aren't duplicates. – UncleBens May 04 '10 at 22:06
  • Adding all the elements to a vector and sorting it is generally faster than inserting elements and keeping sorted order... – fbrereto May 04 '10 at 22:18
  • I'd have to agree with WilhelmTell based off testing a worst case environment where all the elements are already sorted. In this case it seems to work much _slower_ then the first example, when in theory, it should be about the same speed. – Hooked May 04 '10 at 22:19
  • 1
    Or just create another vector of the same length as the original vector, but made up of T* elements. Then sort those via dereferencing. – dash-tom-bang May 04 '10 at 23:00
  • Or do a `std::sort` on the original container with a custom predicate that acts normal but throws an exception as soon as it compares two equal (not just equivalent, although you could do that too) values. If you catch the exception then you know it's not unique already, otherwise it's unique. – caps Feb 24 '16 at 18:09
6

Is it infeasible to just use a container that provides this "guarantee" from the get-go? Would it be useful to flag a duplicate at the time of insertion rather than at some point in the future? When I've wanted to do something like this, that's the direction I've gone; just using the set as the "primary" container, and maybe building a parallel vector if I needed to maintain the original order, but of course that makes some assumptions about memory and CPU availability...

dash-tom-bang
  • 17,383
  • 5
  • 46
  • 62
2

You can use std::unique, but it requires the range to be sorted first:

template <class T>
bool is_unique(vector<T> X) {
  std::sort(X.begin(), X.end());
  return std::unique(X.begin(), X.end()) == X.end();
}

std::unique modifies the sequence and returns an iterator to the end of the unique set, so if that's still the end of the vector then it must be unique.

This runs in nlog(n); the same as your set example. I don't think you can theoretically guarantee to do it faster, although using a C++0x std::unordered_set instead of std::set would do it in expected linear time - but that requires that your elements be hashable as well as having operator == defined, which might not be so easy.

Also, if you're not modifying the vector in your examples, you'd improve performance by passing it by const reference, so you don't make an unnecessary copy of it.

Peter
  • 7,216
  • 2
  • 34
  • 46
2

If I may add my own 2 cents.

First of all, as @Potatoswatter remarked, unless your elements are cheap to copy (built-in/small PODs) you'll want to use pointers to the original elements rather than copying them.

Second, there are 2 strategies available.

  1. Simply ensure there is no duplicate inserted in the first place. This means, of course, controlling the insertion, which is generally achieved by creating a dedicated class (with the vector as attribute).
  2. Whenever the property is needed, check for duplicates

I must admit I would lean toward the first. Encapsulation, clear separation of responsibilities and all that.

Anyway, there are a number of ways depending on the requirements. The first question is:

  • do we have to let the elements in the vector in a particular order or can we "mess" with them ?

If we can mess with them, I would suggest keeping the vector sorted: Loki::AssocVector should get you started. If not, then we need to keep an index on the structure to ensure this property... wait a minute: Boost.MultiIndex to the rescue ?

Thirdly: as you remarked yourself a simple linear search doubled yield a O(N2) complexity in average which is no good.

If < is already defined, then sorting is obvious, with its O(N log N) complexity. It might also be worth it to make T Hashable, because a std::tr1::hash_set could yield a better time (I know, you need a RandomAccessIterator, but if T is Hashable then it's easy to have T* Hashable to ;) )

But in the end the real issue here is that our advises are necessary generic because we lack data.

  • What is T, do you intend the algorithm to be generic ?
  • What is the number of elements ? 10, 100, 10.000, 1.000.000 ? Because asymptotic complexity is kind of moot when dealing with a few hundreds....
  • And of course: can you ensure unicity at insertion time ? Can you modify the vector itself ?
Matthieu M.
  • 287,565
  • 48
  • 449
  • 722
1

Well, your first one should only take N log(N), so it's clearly the better worse case scenario for this application.

However, you should be able to get a better best case if you check as you add things to the set:

template <class T>
bool is_unique3(vector<T> X) {
  set<T> Y;
  typename vector<T>::const_iterator i;
  for(i=X.begin(); i!=X.end(); ++i) {
    if (Y.find(*i) != Y.end()) {
      return false;
    }
    Y.insert(*i);
  }
  return true;
}

This should have O(1) best case, O(N log(N)) worst case, and average case depends on the distribution of the inputs.

clahey
  • 4,795
  • 3
  • 27
  • 20
1

If the type T You store in Your vector is large and copying it is costly, consider creating a vector of pointers or iterators to Your vector elements. Sort it based on the element pointed to and then check for uniqueness.

You can also use the std::set for that. The template looks like this

template <class Key,class Traits=less<Key>,class Allocator=allocator<Key> > class set

I think You can provide appropriate Traits parameter and insert raw pointers for speed or implement a simple wrapper class for pointers with < operator.

Don't use the constructor for inserting into the set. Use insert method. The method (one of overloads) has a signature

pair <iterator, bool> insert(const value_type& _Val);

By checking the result (second member) You can often detect the duplicate much quicker, than if You inserted all elements.

Maciej Hehl
  • 7,895
  • 1
  • 22
  • 23
1

In the (very) special case of sorting discrete values with a known, not too big, maximum value N.
You should be able to start a bucket sort and simply check that the number of values in each bucket is below 2.

bool is_unique(const vector<int>& X, int N)
{
  vector<int> buckets(N,0);
  typename vector<int>::const_iterator i;
  for(i = X.begin(); i != X.end(); ++i)
    if(++buckets[*i] > 1)
      return false;
  return true;
}

The complexity of this would be O(n).

log0
  • 10,489
  • 4
  • 28
  • 62
0

Using the current C++ standard containers, you have a good solution in your first example. But if you can use a hash container, you might be able to do better, as the hash set will be nO(1) instead of nO(log n) for a standard set. Of course everything will depend on the size of n and your particular library implementation.

Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
  • A hashmap will give you big-theta of 1 and O(n^2). – wilhelmtell May 04 '10 at 22:07
  • @wilhelmtell: That... does not sound right. Care to share your math? Inserting into a hashmap is supposed to be O(n) amortized. Assuming your hashmap has a way to detect collisions, you should know by the time you inserted the last element whether there's a collision. The only way I can think of to make it O (N^2) is if you assumed the vector checked for collisions on every insert (which I don't think was part of the question) and then only if it threw away the map after each update to the vector. – Jason May 05 '10 at 00:48
  • I have no clue what that means in my comment. That's O(n), and I blame the cat for that typo. – wilhelmtell May 05 '10 at 01:14