1

I have a very large std::vector of std::vectors containing a fixed number of unsigned integers.

All vectors of uints are sorted ascending.

My current way to eliminate duplicate vectors is

unsigned int i = 0;
while ( i < new_combs.size() )
{
  unsigned int j = i + 1;
  while ( j < new_combs.size() )
  {
     unsigned int k = 0;
     while ( k < new_combs.at(i).size() && new_combs.at(i).at(k) == new_combs.at(j).at(k) )
        ++k;
     if ( k == new_combs.at(j).size() )
        new_combs.erase(new_combs.begin() + j);
     else
        ++j;
  }
  ++i;
}

here, new_combs is a vector containing vectors as mentioned above.

Is there a more efficient way to eliminate duplicates if the vector of vectors is unsorted?

Niklas Hansson
  • 503
  • 2
  • 16
stefan
  • 10,215
  • 4
  • 49
  • 90
  • Duplicate of [Most efficient way to erase duplicates and sort a c++ vector?](http://stackoverflow.com/questions/1041620/most-efficient-way-to-erase-duplicates-and-sort-a-c-vector) – Anonymous Apr 10 '12 at 12:19

6 Answers6

9

A shorter way would be using <algorithm>:

std::sort(new_combs.begin(), new_combs.end());
new_combs.erase(std::unique(new_combs.begin(), new_combs.end()), new_combs.end());

Unless you specifically need a std::vector, you can use std::set to avoid duplicates.

Luchian Grigore
  • 253,575
  • 64
  • 457
  • 625
3

Have you considered using std::set? It is ordered and doesn't allow duplicates to begin with.

xpapad
  • 4,376
  • 1
  • 24
  • 25
2

Not much you can do if the vector is unsorted. If it is sorted however you can use the unique method defined in algorithm:

new_combs.erase(unique(new_combs.begin(), new_combs.end()), new_combs.end());
Ivaylo Strandjev
  • 69,226
  • 18
  • 123
  • 176
0

There are several elements in your code that ring my alarm bell regarding performance.

First, you are using vectors. Erasing elements from vectors is always slow. You might consider using a different container (std::list) or adjust your code so that you have a special value for nothing (e.g. zero or -1).

Second, you could use an std::set or std::unordered_set to keep values that you already encountered. That way, you only have to loop over your vectors once.

EDIT: Forget this answer. I misread the question and thought that duplicate values (not duplicate vectors) had to be removed.

Nevertheless, some reactions on the comments given:

  • @Jerry: I agree that vectors are in most cases faster than lists, but only if the size of the vector is limited. If the vector contains 1 million elements and you need to remove the 3rd, then the 5th, then the 10th, ... you end up moving lots of elements. In such a case a list could be faster.
  • @James: in the original question, the elements were not removed from the end of the vector, but in the middle. If the vector is very big (let's say 1 million elements), then removing elements can still become a bottleneck. However, I agree than using a sort, followed by a unique is probably faster.
Patrick
  • 23,217
  • 12
  • 67
  • 130
  • I don't understand what you mean by "have a special value for nothing", my vectors are all of equal size and the size is always greater than 0 – stefan Apr 10 '12 at 12:23
  • Have you ever actually compared vectors to lists? You might want to read a [previous question](http://stackoverflow.com/q/9764452/179910), including both the answers and the link. Bottom line: even when it should theoretically be superior, list is usually slower than vector. – Jerry Coffin Apr 10 '12 at 14:10
  • Erasing elements from the end of a vector is **not** slow. Using the solutions with a sorted vector and `std::unique` will probably be considerably faster than using any form of map (which are node based containers, and notoriously slow). – James Kanze Apr 10 '12 at 14:54
0

Asymptotically, your algorithm seems like the usual O(n) implementation and is therefore optimal. (Even though I didn't understand your diagonalization strategy with i and j and why you only erase, but never move elements. Your code is highly unclear.) However, you are duplicating the STL, and a shorter version of a unique-ing loop is:

struct unique {
    template <class C>
    void operator()( C& c ) {
         c.erase( std::unique( c.begin(), c.end() ), c.end() );
    }
};

std::for_each( new_combs.begin(), new_combs.end(), unique() );
thiton
  • 35,651
  • 4
  • 70
  • 100
0

I agree with Luchian Grigore's answer, but you might also consider converting the whole outer vector to unordered_set, which is an O(n) operation provided hashes of sub-vectors are not too lopsided (as opposed to average O(n*log(n)) for sorting). You can even use pointers to sub-vectors in your unordered_set, to avoid unnecessary copying. This can be an important performance difference for large amounts of data.

This example illustrates the basic idea of using your own hash function and pointers (it deals with vector of strings and uses unordered_map, not unordered_set, but you should be able to modify it fairly easily to your needs).

Community
  • 1
  • 1
Branko Dimitrijevic
  • 50,809
  • 10
  • 93
  • 167