eliminating duplicates in std::vector

Question

I have a very large std::vector of std::vectors containing a fixed number of unsigned integers.

All vectors of uints are sorted ascending.

My current way to eliminate duplicate vectors is

unsigned int i = 0;
while ( i < new_combs.size() )
{
  unsigned int j = i + 1;
  while ( j < new_combs.size() )
  {
     unsigned int k = 0;
     while ( k < new_combs.at(i).size() && new_combs.at(i).at(k) == new_combs.at(j).at(k) )
        ++k;
     if ( k == new_combs.at(j).size() )
        new_combs.erase(new_combs.begin() + j);
     else
        ++j;
  }
  ++i;
}

here, new_combs is a vector containing vectors as mentioned above.

Is there a more efficient way to eliminate duplicates if the vector of vectors is unsorted?

Duplicate of [Most efficient way to erase duplicates and sort a c++ vector?](http://stackoverflow.com/questions/1041620/most-efficient-way-to-erase-duplicates-and-sort-a-c-vector) — Anonymous, Apr 10 '12 at 12:19

score 9 · Accepted Answer · answered Apr 10 '12 at 12:16

9

A shorter way would be using <algorithm>:

std::sort(new_combs.begin(), new_combs.end());
new_combs.erase(std::unique(new_combs.begin(), new_combs.end()), new_combs.end());

Unless you specifically need a std::vector, you can use std::set to avoid duplicates.

answered Apr 10 '12 at 12:16

Luchian Grigore

253,575
64
457
625

isn't sorting also at least O(n²) ? – stefan Apr 10 '12 at 12:19
@stefan your solution is O(n^2). And std::sort is surprisingly fast. – Luchian Grigore Apr 10 '12 at 12:24
i know that my algorithm is O(n²), but thats not an argument for std::sort beeing faster than O(n²) ;-) – stefan Apr 10 '12 at 12:27
2

@stefan std::sort is, in worst case, O(n^2). It's a quicksort, which is usually O(n*log(n)). – Luchian Grigore Apr 10 '12 at 12:30
1

@stefan sort is n*log(n) on average. – juanchopanza Apr 10 '12 at 12:30

score 3 · Answer 2 · answered Apr 10 '12 at 12:17

3

Have you considered using std::set? It is ordered and doesn't allow duplicates to begin with.

answered Apr 10 '12 at 12:17

xpapad

4,376
1
24
25

Ivaylo Strandjev · Answer 3 · 2012-04-10T12:25:12.920

2

Not much you can do if the vector is unsorted. If it is sorted however you can use the unique method defined in algorithm:

new_combs.erase(unique(new_combs.begin(), new_combs.end()), new_combs.end());

edited Apr 10 '12 at 12:25

answered Apr 10 '12 at 12:17

Ivaylo Strandjev

69,226
18
123
176

Patrick · Answer 4 · 2012-04-10T15:50:17.927

There are several elements in your code that ring my alarm bell regarding performance.

First, you are using vectors. Erasing elements from vectors is always slow. You might consider using a different container (std::list) or adjust your code so that you have a special value for nothing (e.g. zero or -1).

Second, you could use an std::set or std::unordered_set to keep values that you already encountered. That way, you only have to loop over your vectors once.

EDIT: Forget this answer. I misread the question and thought that duplicate values (not duplicate vectors) had to be removed.

Nevertheless, some reactions on the comments given:

@Jerry: I agree that vectors are in most cases faster than lists, but only if the size of the vector is limited. If the vector contains 1 million elements and you need to remove the 3rd, then the 5th, then the 10th, ... you end up moving lots of elements. In such a case a list could be faster.
@James: in the original question, the elements were not removed from the end of the vector, but in the middle. If the vector is very big (let's say 1 million elements), then removing elements can still become a bottleneck. However, I agree than using a sort, followed by a unique is probably faster.

I don't understand what you mean by "have a special value for nothing", my vectors are all of equal size and the size is always greater than 0 — stefan, Apr 10 '12 at 12:23
Have you ever actually compared vectors to lists? You might want to read a [previous question](http://stackoverflow.com/q/9764452/179910), including both the answers and the link. Bottom line: even when it should theoretically be superior, list is usually slower than vector. — Jerry Coffin, Apr 10 '12 at 14:10
Erasing elements from the end of a vector is **not** slow. Using the solutions with a sorted vector and `std::unique` will probably be considerably faster than using any form of map (which are node based containers, and notoriously slow). — James Kanze, Apr 10 '12 at 14:54

score 0 · Answer 5 · answered Apr 10 '12 at 12:19

0

Asymptotically, your algorithm seems like the usual O(n) implementation and is therefore optimal. (Even though I didn't understand your diagonalization strategy with i and j and why you only erase, but never move elements. Your code is highly unclear.) However, you are duplicating the STL, and a shorter version of a unique-ing loop is:

struct unique {
    template <class C>
    void operator()( C& c ) {
         c.erase( std::unique( c.begin(), c.end() ), c.end() );
    }
};

std::for_each( new_combs.begin(), new_combs.end(), unique() );

answered Apr 10 '12 at 12:19

thiton

35,651
4
70
100

why should i move? i don't care about sorting the vector – stefan Apr 10 '12 at 12:24
@stefan: You want to remove duplicates, and a vector is contiguous, so you'll have to move up some elements at some point. – thiton Apr 11 '12 at 07:38

score 0 · Answer 6 · edited May 23 '17 at 12:27

I agree with Luchian Grigore's answer, but you might also consider converting the whole outer vector to unordered_set, which is an O(n) operation provided hashes of sub-vectors are not too lopsided (as opposed to average O(n*log(n)) for sorting). You can even use pointers to sub-vectors in your unordered_set, to avoid unnecessary copying. This can be an important performance difference for large amounts of data.

This example illustrates the basic idea of using your own hash function and pointers (it deals with vector of strings and uses unordered_map, not unordered_set, but you should be able to modify it fairly easily to your needs).

eliminating duplicates in std::vector

6 Answers6