19

Possible Duplicate:
Determining if an unordered vector<T> has all unique elements

I have to check a vector for duplicates. What is the best way to approach this:

I take the first element, compare it against all other elements in the vector. Then take the next element and do the same and so on.

Is this the best way to do it, or is there a more efficient way to check for dups?

Community
  • 1
  • 1
Ayush
  • 41,754
  • 51
  • 164
  • 239
  • 2
    Duplicate of [Determining if an unordered vector has all unique elements](http://stackoverflow.com/questions/2769174/determining-if-an-unordered-vectort-has-all-unique-elements) – James McNellis May 18 '10 at 20:16
  • Can you modify the vector? If not, do you have memory to allocate a copy? – florin May 18 '10 at 21:21
  • "...take the next element and do the same..." for what it's worth, when you look at the second element, you can ignore the first element ;-) (And when you're looking at the third element, you can ignore the first two, etc.) – lmat - Reinstate Monica Mar 07 '19 at 16:14

5 Answers5

15

If your vector is an STL container, the solution is easy:

std::sort(myvec.begin(), myvec.end());
std::erase(std::unique(myvec.begin(), myvec.end()), myvec.end());

According to cppreference (https://en.cppreference.com/w/cpp/algorithm/unique), the elements are shifted around so that the values from myvec.begin() to the return value of std::unique are all unique. The elements after the iterator returned by std::unique are unspecified (useless in every use-case I've seen) so remove them from the std::vector<A> using std::vector<A>::erase.

lmat - Reinstate Monica
  • 7,289
  • 6
  • 48
  • 62
Patrick
  • 23,217
  • 12
  • 67
  • 130
  • 7
    To clarify, the duplicates are not moved to the end of the range; they are just removed from the front of the range. The values of the elements after the new end returned by `std::unique()` are unspecified. If you only want to test whether the range contains no duplicates, `std::adjacent_find()` is more efficient than using `std::unique()`. – James McNellis May 18 '10 at 20:11
  • 1
    You're right. Std::unique puts all the unique elements first, and doesn't specify what happens with the rest of the container. Most important thing however, is to remember that you should use the returned iterator, and not assume that your container only contains unique elements. You have to manually cleanup the tail of the container. – Patrick May 19 '10 at 06:32
13

Use a hash table in which you insert each element. Before you insert an element, check if it's already there. If it is, you have yourself a duplicate. This is O(n) on average, but the worst case is just as bad as your current method.

Alternatively, you can use a set to do the same thing in O(n log n) worst case. This is as good as the sorting solution, except it doesn't change the order of the elements (uses more memory though since you create a set).

Another way is to copy your vector to another vector, sort that and check the adjacent elements there. I'm not sure if this is faster than the set solution, but I think sorting adds less overhead than the balanced search trees a set uses so it should be faster in practice.

Of course, if you don't care about keeping the original order of the elements, just sort the initial vector.

IVlad
  • 43,099
  • 13
  • 111
  • 179
  • 3
    Not quite "as good" as the sorting solution. It is the same big-O runtime order, but the constant factor on sorting a vector, which is guaranteed to have its elements contiguous in memory, is going to be significantly smaller than the algorithm using the set. I wouldn't be surprised at all if it was twice as fast. +1 anyway. I think you have the best answer. – A. Levy May 18 '10 at 20:05
  • @A. Levy: true, I mentioned another method. – IVlad May 18 '10 at 20:07
  • A Radix sort can be even faster than O(n log n). http://en.wikipedia.org/wiki/Radix_sort – Mark Ransom May 18 '10 at 20:41
  • @Mark Ransom, radix sort is rarely applicable. – avakar May 18 '10 at 21:40
  • @avakar, true, but rarely is not the same as never. I think because there are no library implementations that it gets forgotten more often than it should. – Mark Ransom May 18 '10 at 22:05
2

If you don't care about an occasional false positive, you can use a Bloom Filter to detect probable duplicates in the collection. If false positives can't be accepted, take the values that fail the filter and run a second detection pass on those. The list of failed values should be fairly small, although they will need to be checked against the full input.

Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
1

Sorting and then comparing adjacent elements is the way to go. A sort takes O(n log n) comparisons, an then an additional n-1 to compare adjacent elements.

The scheme in the question would take (n^2)/2 comparisons.

KeithB
  • 16,577
  • 3
  • 41
  • 45
1

You can also use binary_search.

Here are two good examples that will help you:

http://www.cplusplus.com/reference/algorithm/binary_search/

http://www.cplusplus.com/reference/algorithm/unique_copy/

Quaternion
  • 10,380
  • 6
  • 51
  • 102
LoudNPossiblyWrong
  • 3,855
  • 7
  • 33
  • 45