what are the fast algorithms to find duplicate elements in a collection and group them?

Question

Say you have a collection of elements, how can you pick out those with duplicates and put them into each group with least amount of comparison? preferably in C++, but algorithm is more important than the language. For Example given {E1,E2,E3,E4,E4,E2,E6,E4,E3}, I wish to extract out {E2,E2}, {E3,E3}, {E4,E4,E4}. what data structure and algorithm you will choose? Please also include the cost of setting up the data structure, say, if it's a pre-sorted one like std::multimap

Updates

To make things clearer as suggested. there's one constraint: the elements must be compared by themselves to be certain they are duplicates.

So hashes do not apply, because virtually they shift the comparison to from heavy elements(e.g. chunks of data) to light elements(integers), and reduce some comparison, but not do away with them, and in the end, we are back to our original problem, when are inside one collision bucket.

Pretend you have a bunch of potentials duplicate files of GBs each, they bear the same hash value by every hash-algorithm human beings know. Now you are going to spot the real duplicates.

No, it can't be a real-life problem(even MD5 is enough to generate unique hash for real-life files). But just pretend so that we can focus on finding the data structure + algorithm that involves least amount of comparison.

What I am doing is to

represent into a STL std::list data structure(in that 1) its element-deletion is cheaper than, say, a vector 2) its insertion is cheaper, not requiring sort.)
pop out one element and compare it with the rest, if a duplicate is found, it's pulled out of the list. once the end of the list is reached, one group of duplication is found, if any.
repeat the above 2 steps until the list is empty.

It needs N-1 in the best case, but (N-1)! in the worse case.

what are the better alternatives?

My code using method explained above:

// algorithm to consume the std::list container,
// supports: list<path_type>,list< pair<std::string, paths_type::const_iterater>>
template<class T>
struct consume_list
{
    groups_type operator()(list<T>& l)
    {
        // remove spurious identicals and group the rest
        // algorithm:  
        // 1. compare the first element with the remaining elements, 
        //    pick out all duplicated files including the first element itself.
        // 2. start over again with the shrinked list
        //     until the list contains one or zero elements.

        groups_type sub_groups;           
        group_type one_group; 
        one_group.reserve(1024);

        while(l.size() > 1)
        {
            T front(l.front());
            l.pop_front();

            item_predicate<T> ep(front);
            list<T>::iterator it     = l.begin(); 
            list<T>::iterator it_end = l.end();
            while(it != it_end)
            {
                if(ep.equals(*it))
                {
                    one_group.push_back(ep.extract_path(*(it))); // single it out
                    it = l.erase(it);
                }
                else
                {
                    it++;
                }
            }

            // save results
            if(!one_group.empty())
            {
                // save
                one_group.push_back(ep.extract_path(front));                    
                sub_groups.push_back(one_group);

                // clear, memory allocation not freed
                one_group.clear(); 
            }            
        }
        return sub_groups;
    }        
}; 


// type for item-item comparison within a stl container, e.g.  std::list 
template <class T>
struct item_predicate{};

// specialization for type path_type      
template <>
struct item_predicate<path_type>
{
public:
    item_predicate(const path_type& base)/*init list*/            
    {}
public:
    bool equals(const path_type& comparee)
    {
        bool  result;
        /* time-consuming operations here*/
        return result;
    }

    const path_type& extract_path(const path_type& p)
    {
        return p;
    }
private:
    // class members
}; 


};

Thanks for the answer below, however they seem to be misled by my example that it's about integers. In fact the elements are type agnostic(not necessarily integers, strings or any other PODs), and the equal predicates are self-defined, that is the comparison can be very heavy.

So maybe my question should be: using which data structure + algorithm involves fewer comparisons.

Using a pre-sorted container like multiset, multimap is not better according to my test, since

the sorting while inserting already does the comparisons,
the following adjacent finding does comparison again,
these data structure prefer less-than operations to equal operations, they perform 2 less-than(a

I do not see how it can save comparisons.

one more thing that's ignored by some answers below, I need to differentiate the duplicate groups from one another, not just keep them in the container.

Conclusion

After all the discussion, there seem to be 3 ways

my original naive method as explained above
Start with a linear container like std::vector , sort it and then locate the equal ranges
start with an associated container like std::map<Type, vector<duplicates>>, pick out the duplicates during the setup of associated container as suggested by Charles Bailey.

I've coded a sample to test all the methods as posted below.

the number of duplicates and when they are distributed may influence the best choice.

Method 1 is best when they fall heavily at the front, and is worst when at the end. Sort will not change the distribution, but the endian.
Method 3 has the most average performance
Method 2 is never the best choice

Thanks for all who participating in the discussion.

one output with 20 sample items from the code below.

Test with [ 20 10 6 5 4 3 2 2 2 2 1 1 1 1 1 1 1 1 1 1 ]

and [ 1 1 1 1 1 1 1 1 1 1 2 2 2 2 3 4 5 6 10 20 ] respectively

using std::vector -> sort() -> adjacent_find():

comparisons: [ '<' = 139, '==' = 23 ]

comparisons: [ '<' = 38, '==' = 23 ]

using std::list -> sort() -> shrink list:

comparisons: [ '<' = 50, '==' = 43 ]

comparisons: [ '<' = 52, '==' = 43 ]

using std::list -> shrink list:

comparisons: [ '<' = 0, '==' = 121 ]

comparisons: [ '<' = 0, '==' = 43 ]

using std::vector -> std::map>:

comparisons: [ '<' = 79, '==' = 0 ]

comparisons: [ '<' = 53, '==' = 0 ]

using std::vector -> std::multiset -> adjacent_find():

comparisons: [ '<' = 79, '==' = 7 ]

comparisons: [ '<' = 53, '==' = 7 ]

Code

// compile with VC++10: cl.exe /EHsc

#include <vector>
#include <deque>
#include <list>
#include <map>
#include <set>
#include <algorithm>
#include <iostream>
#include <sstream>

#include <boost/foreach.hpp>
#include <boost/tuple/tuple.hpp>
#include <boost/format.hpp>

using namespace std;

struct Type
{
    Type(int i) : m_i(i){}

    bool operator<(const Type& t) const
    {
        ++number_less_than_comparison;
        return m_i < t.m_i;
    }

    bool operator==(const Type& t) const
    {
        ++number_equal_comparison;    
        return m_i == t.m_i;
    }
public:
    static void log(const string& operation)
    {
        cout 
        << "comparison during " <<operation << ": [ "
        << "'<'  = " << number_less_than_comparison
        << ", "
        << "'==' = " << number_equal_comparison
        << " ]\n";

        reset();
    }

    int to_int() const
    {
        return m_i;
    }
private:
    static void reset()
    {
        number_less_than_comparison = 0;
        number_equal_comparison = 0;      
    }

public:
    static size_t number_less_than_comparison;
    static size_t number_equal_comparison;    
private:
    int m_i;
};

size_t Type::number_less_than_comparison = 0;
size_t Type::number_equal_comparison = 0;  

ostream& operator<<(ostream& os, const Type& t) 
{
    os << t.to_int();
    return os;
}

template< class Container >
struct Test
{    
    void recursive_run(size_t n)
    { 
        bool reserve_order = false;

        for(size_t i = 48; i < n; ++i)
        {
            run(i);
        }    
    }

    void run(size_t i)
    {
        cout << 
        boost::format("\n\nTest %1% sample elements\nusing method%2%:\n") 
        % i 
        % Description();

        generate_sample(i);
        sort();
        locate();   

        generate_reverse_sample(i);
        sort();
        locate(); 
    }

private:    
    void print_me(const string& when)
    {
        std::stringstream ss;
        ss << when <<" = [ ";
        BOOST_FOREACH(const Container::value_type& v, m_container)
        {
            ss << v << " ";
        }
        ss << "]\n";    
        cout << ss.str();
    }

    void generate_sample(size_t n)
    {
        m_container.clear();
        for(size_t i = 1; i <= n; ++i)
        {
            m_container.push_back(Type(n/i));    
        }
        print_me("init value");
        Type::log("setup");
    }

    void generate_reverse_sample(size_t n)
    {
        m_container.clear();
        for(size_t i = 0; i < n; ++i)
        {
            m_container.push_back(Type(n/(n-i)));     
        }
        print_me("init value(reverse order)");
        Type::log("setup");
    }    

    void sort()
    {    
        sort_it();

        Type::log("sort");
        print_me("after sort");

    }

    void locate()
    {
        locate_duplicates();

        Type::log("locate duplicate");
    }
protected:
    virtual string Description() = 0;
    virtual void sort_it() = 0;
    virtual void locate_duplicates() = 0;
protected:
    Container m_container;    
};

struct Vector : Test<vector<Type> >
{    
    string Description()
    {
        return "std::vector<Type> -> sort() -> adjacent_find()";
    } 

private:           
    void sort_it()
    {    
        std::sort(m_container.begin(), m_container.end()); 
    }

    void locate_duplicates()
    {
        using std::adjacent_find;
        typedef vector<Type>::iterator ITR;
        typedef vector<Type>::value_type  VALUE;

        typedef boost::tuple<VALUE, ITR, ITR> TUPLE;
        typedef vector<TUPLE> V_TUPLE;

        V_TUPLE results;

        ITR itr_begin(m_container.begin());
        ITR itr_end(m_container.end());       
        ITR itr(m_container.begin()); 
        ITR itr_range_begin(m_container.begin());  

        while(itr_begin != itr_end)
        {     
            // find  the start of one equal reange
            itr = adjacent_find(
            itr_begin, 
            itr_end, 
                []  (VALUE& v1, VALUE& v2)
                {
                    return v1 == v2;
                }
            );
            if(itr_end == itr) break; // end of container

            // find  the end of one equal reange
            VALUE start = *itr; 
            while(itr != itr_end)
            {
                if(!(*itr == start)) break;                
                itr++;
            }

            results.push_back(TUPLE(start, itr_range_begin, itr));

            // prepare for next iteration
            itr_begin = itr;
        }  
    }
};

struct List : Test<list<Type> >
{
    List(bool sorted) : m_sorted(sorted){}

    string Description()
    {
        return m_sorted ? "std::list -> sort() -> shrink list" : "std::list -> shrink list";
    }
private:    
    void sort_it()
    {
        if(m_sorted) m_container.sort();////std::sort(m_container.begin(), m_container.end()); 
    }

    void locate_duplicates()
    {       
        typedef list<Type>::value_type VALUE;
        typedef list<Type>::iterator ITR;

        typedef vector<VALUE>  GROUP;
        typedef vector<GROUP>  GROUPS;

        GROUPS sub_groups;
        GROUP one_group; 

        while(m_container.size() > 1)
        {
            VALUE front(m_container.front());
            m_container.pop_front();

            ITR it     = m_container.begin(); 
            ITR it_end = m_container.end();
            while(it != it_end)
            {
                if(front == (*it))
                {
                    one_group.push_back(*it); // single it out
                    it = m_container.erase(it); // shrink list by one
                }
                else
                {
                    it++;
                }
            }

            // save results
            if(!one_group.empty())
            {
                // save
                one_group.push_back(front);                    
                sub_groups.push_back(one_group);

                // clear, memory allocation not freed
                one_group.clear(); 
            }            
        }
    }        

private:
    bool m_sorted;
};

struct Map : Test<vector<Type>>
{    
    string Description()
    {
        return "std::vector -> std::map<Type, vector<Type>>" ;
    }
private:    
    void sort_it() {}

    void locate_duplicates()
    {
        typedef map<Type, vector<Type> > MAP;
        typedef MAP::iterator ITR;

        MAP local_map;

        BOOST_FOREACH(const vector<Type>::value_type& v, m_container)
        {
            pair<ITR, bool> mit; 
            mit = local_map.insert(make_pair(v, vector<Type>(1, v)));   
            if(!mit.second) (mit.first->second).push_back(v); 
         }

        ITR itr(local_map.begin());
        while(itr != local_map.end())
        {
            if(itr->second.empty()) local_map.erase(itr);

            itr++;
        }
    }        
};

struct Multiset :  Test<vector<Type>>
{
    string Description()
    {
        return "std::vector -> std::multiset<Type> -> adjacent_find()" ;
    }
private:
    void sort_it() {}

    void locate_duplicates()
    {   
        using std::adjacent_find;

        typedef set<Type> SET;
        typedef SET::iterator ITR;
        typedef SET::value_type  VALUE;

        typedef boost::tuple<VALUE, ITR, ITR> TUPLE;
        typedef vector<TUPLE> V_TUPLE;

        V_TUPLE results;

        SET local_set;
        BOOST_FOREACH(const vector<Type>::value_type& v, m_container)
        {
            local_set.insert(v);
        }

        ITR itr_begin(local_set.begin());
        ITR itr_end(local_set.end());       
        ITR itr(local_set.begin()); 
        ITR itr_range_begin(local_set.begin());  

        while(itr_begin != itr_end)
        {     
            // find  the start of one equal reange
            itr = adjacent_find(
            itr_begin, 
            itr_end, 
            []  (VALUE& v1, VALUE& v2)
                {
                    return v1 == v2;
                }
            );
            if(itr_end == itr) break; // end of container

            // find  the end of one equal reange
            VALUE start = *itr; 
            while(itr != itr_end)
            {
                if(!(*itr == start)) break;                
                itr++;
            }

            results.push_back(TUPLE(start, itr_range_begin, itr));

            // prepare for next iteration
            itr_begin = itr;
        }  
    } 
};

int main()
{
    size_t N = 20;

    Vector().run(20);
    List(true).run(20);
    List(false).run(20);
    Map().run(20);
    Multiset().run(20);
}

I think you need to describe the purpose of your algorithm a bit better. Given that the objective is to end with an empty list, wouldn't throwing all in the input away be the fastest and simplest solution? — CB Bailey, Aug 26 '09 at 06:29
the input is a collection that contains possible duplicate elements, and the outputs are zero or more collections that each contain duplicates of the same kind. (I prefer not to user the term "list", because it may imply some implementation, e.g. std::list ) — t.g., Aug 26 '09 at 06:38
In that case, why isn't a pre-sorted container (e.g. multiset) a solution. i.e. post-edit, why do you imply that you need to do "adjacent finding". Surely, once the container is populated, you just output it in order, no comparisons required? — CB Bailey, Aug 26 '09 at 06:45
adjacent finding is needed because there may be multiple duplicate groups. for example, in a pre-sorted container like this {E1, E2, E2, E4, E4, E5}, I need to single out {E2, E2} and {E4, E4}. without adjacent finding, I don't know where E2 and E4 starts and ends, — t.g., Aug 26 '09 at 07:04
What's the required output interface, then? Do you need to output a structured collection object or are you outputting a value for '{', ',' and '}'? Are 'E2' and 'E2' identical, or merely equivalent? — CB Bailey, Aug 26 '09 at 07:14
equivalent. and to check that they are equivalent are expensive operations that may take a dozen of seconds. the output interface doesn't matter as long as the information is extracted out. In my std::multimap version of implementation, I just saved the std::multimap::iterators representing the range. Because with std::multimap involves comparisons at insertion and adjacent finding. I switched to a non-sorted container std::list in hope to save comparisons. But no significant improvement, nor decrease. A container bounded with an algorithm with least comparison is what I need. — t.g., Aug 26 '09 at 07:29
The output format is irrelevant too. I put "'{', ',' and '}'" only to explain things in a way that everyone understands easily — t.g., Aug 26 '09 at 07:33
I don't think that the output format is irrelevant. I've just described an output format that is a flat iteration through the entire collection that signals the change in group by outputting an element that does not `==` the previous one. This doesn't seem to match what you want so you need to explain what's supposed to happen to these groups next. At the moment I think that either a multiset, or a map of element to a list of other matching elements would meet your requirements and not need a second comparison step, but your requirements aren't clear enough to be certain. — CB Bailey, Aug 26 '09 at 08:21
with a multiset, you have to do comparison in order to sort while constructing the multiset. this is the 1st round of comparison. And before the set is completely constructed， i.e. all elements are inserted, you cannot identify the ranges for duplicates. you have to wait until all the insertions are done, now you start to iterate the set through(hence a second round of comparison) to identify the ranges. with a list, no comparison is needed at insertion, thus only doing comparisons at iteration. But i am not sure whether the later is the best one. and that's the point of my question. — t.g., Aug 26 '09 at 08:38
I think I understand your concern so I've posted an answer. Finding the ranges in a multiset will only involve n-1 comparisons, though, which (even including the more expensive insert) in comparison to removing equivalent elements, one group at a time via a linear search through a list may not be any more expensive for many inputs. — CB Bailey, Aug 26 '09 at 08:51
OK, I still haven't understood the question. Please can you update with some sample code that implements a dumb algorithm that produces the output that you need and highlight which operations are expensive. My current understanding is that you have a cheap weaker order and an expensive stronger order, that you want to generate groups that are equivalent under the weaker order, but with duplicates under the stronger eliminated but this is inferred from a lot of separate comments and it would be good to have the problem stated more precisely and have some target comparison counts to shoot at. — CB Bailey, Aug 27 '09 at 09:17
It won't save you comparisons, but you should almost certainly use a vector instead of a list. Cache misses are painful, and the erase/remove idiom takes care of deletions efficiently. And I don't see why you'd need a lot of insertions once the list is sorted. — jalf, Aug 27 '09 at 09:46
@Charles Bailey, code appended. @jalf, vectors are no good, because I want to shrink the container as I go on so that I have less and less comparison. with vectors 1) deletion in the middle is in-effiecient, 2) and worse, iterators are not invalid after deletion in a loop — t.g., Aug 27 '09 at 10:40
@t.g. Why do you need to shrink the vector to avoid comparisons? Just use a pair of iterators to denote the subrange you need to compare. Instead of deleting elements, swap them to the end, and exclude them from that subrange. — jalf, Aug 27 '09 at 13:33
@jalf, because the elements are random. If there is a sub-range, it happens to be [begin(),end()). — t.g., Aug 27 '09 at 14:19

score 14 · Answer 1 · edited Jun 20 '20 at 09:12

14

Yes, you can do much better.

Sort them (O(n) for simple integers, O(n*log n) in general), then duplicates are guaranteed to be adjacent, making finding them quick O(n)
Use a hash table, also O(n). For each item, (a) check if it's already in the hash table; if so, its a duplicate; if not, put it in the hash table.

edit

The method you're using seems to do O(N^2) comparisons:

for i = 0; i < length; ++i           // will do length times
    for j = i+1; j < length; ++j     // will do length-i times
        compare

So for length 5 you do 4+3+2+1=10 compares; for 6 you do 15, etc. (N^2)/2 - N/2 to be exact. N*log(N) is smaller, for any reasonably high value of N.

How big is N in your case?

As far as reducing hash collisions, the best way is to get a better hash function :-D. Assuming that is not possible, if you can make a variant (e.g., different modulous), you may be able to do a nested hash.

edited Jun 20 '20 at 09:12

Community

1
1

answered Aug 26 '09 at 05:38

derobert

49,731
15
94
124

my problem is there hash are already equal.Suppose we are already in a collision bucket, and we are going to grouping these elements. what's the best way to do it if the element equal-predicate is so expensive that I want to save every single one? – t.g. Aug 26 '09 at 06:15
Thanks for the elaboration. In the worst case(when no two elements are equal), . (N^2)/2 - N/2 is expected. but in the best case(when all are equal), only N-1 is needed, 'cause it's pulled out after the comparison. The nature of keeping the iterators valid after erasing elements during a loop is another factor I considered when choosing std::list over std::vector – t.g. Aug 27 '09 at 02:08

score 7 · Answer 2 · answered Aug 26 '09 at 05:38

7

1. Sort the array O(n log n) at worst - mergesort/heapsort/binary tree sort etc

2. Compare neighbors and pull the matches out O(n)

answered Aug 26 '09 at 05:38

Nathan

2,053
1
14
18

that's what I did at first. using STL std::multimap and std::adjacent_find, not significantly better though. – t.g. Aug 26 '09 at 06:07
The multimap will not give you the performance you want. It is faster to push all the elements into a vector and sort them when done than to keep a sorted multimap, which has to do tree balancing to keep things sorted. – fbrereto Aug 26 '09 at 17:02
@fbrereton, thanks for this info, I'll test that. Probably I was intrigued by easy-to-get the multimap::euqal_range() interface too much before further thinking at first. – t.g. Aug 27 '09 at 02:11
There is an equal_range algorithm that works on all STL containers: http://www.cplusplus.com/reference/algorithm/equal_range/ – fbrereto Aug 27 '09 at 02:59

score 5 · Answer 3 · answered Aug 26 '09 at 05:38

5

Keep a hash-table based structure from value to count; if your C++ implementation doesn't offer std::hash_map (not really part of the C++ standard so far!-) use Boost or grab a version off the web. One pass over the collection (i.e., O(N)) lets you do a value->count mapping; one more pass over the hash table (<= O(N), clearly) to identify values with a count > 1 and emit them appropriately. Overall O(N), which is not the case for your suggestion.

answered Aug 26 '09 at 05:38

Alex Martelli

854,459
170
1,222
1,395

The second pass isn't needed. If it's already in the hash, its a duplicate. – derobert Aug 26 '09 at 05:40
@derobert, yep, but you don't know **how many times** it's going to be duplicated, and the specs in the question require that. So a second pass (which doesn't change the O(N) nature of the solution) is the simplest approach! – Alex Martelli Aug 26 '09 at 05:49

CB Bailey · Accepted Answer · 2009-08-27T13:15:56.883

You could use a map from a representative element to a list/vector/deque of other elements. This requires relatively fewer comparisons in insertion into the container and means that you can iterate through the resulting groups without having to perform any comparisons.

This example always inserts the first representative element into the mapped deque storage as it makes the subsequent iteration through the group logically simple, but if this duplication proves a problem then it would be easy to only perform the push_back only if (!ins_pair.second).

typedef std::map<Type, std::deque<Type> > Storage;

void Insert(Storage& s, const Type& t)
{
    std::pair<Storage::iterator, bool> ins_pair( s.insert(std::make_pair(t, std::deque<Type>())) );
    ins_pair.first->second.push_back(t);
}

Iterating through the groups is then (relatively) simple and cheap:

void Iterate(const Storage& s)
{
    for (Storage::const_iterator i = s.begin(); i != s.end(); ++i)
    {
        for (std::deque<Type>::const_iterator j = i->second.begin(); j != i->second.end(); ++j)
        {
            // do something with *j
        }
    }
}

I performed some experiments for comparison and object counts. In a test with 100000 objects in random order forming 50000 groups (i.e. and average of 2 objects per group) the above method cost the following number of comparisons and copies:

1630674 comparisons, 443290 copies

(I tried bringing the number of copies down, but only really managed to at the expense of comparisons which seem to be the higher cost operation in your scenario.)

Using a multimap, and retaining the previous element in the final iteration to detect the group transitions cost this:

1756208 comparisons, 100000 copies

Using a single list and popping the front element and performing a linear search for other group members cost:

1885879088 comparisons, 100000 copies

Yes, that's ~1.9b comparisons compared to ~1.6m for my best method. To get the list method to perform anywhere near an optimal number of comparisons it would have to be sorted and this is going to cost a similar number of comparisons as building an inherently ordered container would in the first place.

Edit

I took your posted code and ran the implied algorithm (I had to make some assumptions about the code as there as some assumed definitions) over the same test data set as I used before and I counted:

1885879088 comparisons, 420939 copies

i.e. exactly the same number of comparisons as my dumb list algorithm, but with more copies. I think that that means we using essentially similar algorithms for this case. I can't see any evidence of an alternative sort order, but it looks like you want a list of the groups which contain more than one equivalent elements. This can be simply achieved in my Iterate function by adding in if (i->size > 1) clause.

I still can't see any evidence that building a sorted container such as this map of deques isn't a good (even if not optimal) strategy.

this is almost my solution before I met my problem except that I used list instead of deque. Here comes the question: Now that you have a deque of elements pre-processed by your way, hence they cannot be represented by any other attributes but themselves(think of files that already have identical hases), i.e. they might be equal or diffenret but have to be compared by themselves(which is expensive), how can you group equal ones using minimum amount of comparisons — t.g., Aug 26 '09 at 08:54
Each `deque` only contains equivalent elements, are you saying that you are using a different concept for `==` and `<` ? — CB Bailey, Aug 26 '09 at 09:12
I missed that as I used list. If I were to use deque, i need to do those expensive comparisons before the insertion. No good. Because the comparisons are so expansive, I've tried my best to postpone it(including the way similar you recommended). Now comes the judement day and I have to face it - so I am looking for ways to save the number of comparisons. — t.g., Aug 26 '09 at 09:32
Either the problem is more involved than you've suggested on the question or you've misunderstood my answer. In my answer *each* `deque` contains a group of equivalent elements. According to the original problem statement there is no need to order them. They already contain only equivalent elements. You now seem to be implying that there is a second sub-order that differentiates the elements in each group that you want the groups ordered by. If so, can you update the question to make this clearer? — CB Bailey, Aug 26 '09 at 15:41
@Charles Bailey, thanks for the intensive discussion and helping me clarify my question. I've just updated the question. Regarding your specific implementation: the cost of comparison happens at s.insert(std::make_pair(t, std::deque())). You actually rely on the std::map pre-sorting implementation which is a B-tree as ttvd suggested in this page, which does not necessarily involve least comparison. I did use your way to filter those apparent different elements, Now I am in the deque to find real ones. Your std::list test is not like that of mine, as you don't shrink it while comparing. — t.g., Aug 27 '09 at 03:17
Yes, I understand where I am taking the comparison it. I am taking it at the map insertion because I believe that this makes the overall cost cheapest. My answer assumed a single sort order (with `==` and `<` derived from this order). You need to be a bit clearer about what the two different sort orders are and how they contribute to your required output. — CB Bailey, Aug 27 '09 at 09:22
@Charles Bailey, Thanks for the discussion, I did use your method, but not investigate its performance in details until the discussion with you. — t.g., Aug 31 '09 at 09:00

score 1 · Answer 5 · answered Aug 26 '09 at 05:38

1

Have you tried sorting? For example using an algorithm like quick sort? If the performance is good enough for you then that would be an easy approach.

answered Aug 26 '09 at 05:38

hhafez

38,949
39
113
143

score 1 · Answer 6 · answered Aug 26 '09 at 05:39

The simplest is probably to just sort the list, and then to iterate over it looking for dups.

If you know something about the data, more efficient algorithms are possible.

For example, if you knew the list was large, and only contained integers from 1..n where n is fairly small, you could use a pair of boolean arrays (or a bitmap), and do something like this:

bool[] once = new bool[MAXIMUM_VALUE];
bool[] many = new bool[MAXIMUM_VALUE];
for (int i = 0; i < list.Length; i++)
    if (once[ value[i] ])
        many[ value[i] ] = true;
    else
        once[ value[i] ] = true;

Now, many[] contains an array of which values were seen more than once.

score 1 · Answer 7 · answered Aug 26 '09 at 05:40

If it is known to be a list of integers, and if it is known that they are all between A and B (e.g. A=0, B=9), make an array of B-A elements, and create B-A containers.

In the very specific case (list of plain integers), I propose that you merely count them, since you cannot distinguish different integers, anyway:

for(int i = 0; i < countOfNumbers; i++)
    counter[number[i]]++;

If they are distinguishable, create an array of lists, and add them to the respective list.

If they are not numbers, use a std::map or std::hash_map, mapping keys to lists of values.

score 1 · Answer 8 · answered May 26 '16 at 21:34

1

Since C++11, hash tables are provided by the STL with std::unordered_map. So a O(N) solution is to put your values into an unordered_map< T, <vector<T> >.

answered May 26 '16 at 21:34

Ismael EL ATIFI

1,939
20
16

score 0 · Answer 9 · answered Aug 26 '09 at 06:17

0

Most people mentioning hash/unordered map solutions are assuming O(1) insertion and query time, however it could be O(N) worst case. Additionally, you void the cost of object hashing.

Personally I would insert objects into a binary tree (O(logn) insertion for each), and keep counter at each node. This would yield O(nlogn) construction time and O(n) traversal to identify all duplicates.

answered Aug 26 '09 at 06:17

ttvd

1,665
1
18
16

You are right about the other answers that they ignore the cost of construction of the data structure and only measures the algorithm benefit attached to the specific data structure. Is your method just that of the implementation of std::map? it seems that you are with the others that choosing a data structure is more important that a transversing algorithm. – t.g. Aug 26 '09 at 06:27
Indeed. I am sorry I cannot leave comments in your original question. Is it feasible to create an empty collection (list) for each type of object in your list (one for all E2's, one for all E3's, etc.)? If so, you could linearly traverse your object set and depending on the object find its corresponding "storage list" and insert it there. This would avoid comparisons and depending on your implementation this could be implemented in O(N). – ttvd Aug 26 '09 at 08:24
I guess you are suggesting something similar to Charles Bailey. I indeed did that as a pre-step to group almost-equal elements. and then my problem arises, I cannot delegate the comparison to any other types, but themselves, and of course they are expensive. So I am looking for some way to save the number of comparisons, since it's no way to do away with them – t.g. Aug 26 '09 at 09:07

jalf · Answer 10 · 2009-08-27T10:12:29.527

0

If I've understood the question correctly, then this is the simplest solution I can think of:

std::vector<T> input;
typedef std::vector<T>::iterator iter;

std::vector<std::pair<iter, iter> > output;

sort(input.begin(), input.end()); // O(n lg n) comparisons

for (iter cur = input.begin(); cur != input.end();){
  std::pair<iter, iter> range = equal_range(cur, input.end(), *cur); // find all elements equal to the current one O(lg n) comparisons
  cur = range.second;
  output.push_back(range);
}

Total running time: O(n log n). You have one sorting pass O(n lg n) and then a second pass where O(lg n) comparisons are performed for each group (so this is done at most n times, also yielding O(n lg n).

Note that this depends on the input being a vector. Only random access iterators have logarithmic complexity in the second pass. Bidirectional iterators would be linear.

This doesn't rely on hashing (as requested), and it preserves all the original elements (rather than just returning one element for each group, along with a count of how often it occurred).

Of course, a number of smaller constant optimizations are possible. output.reserve(input.size()) on the output vector would be a good idea, to avoid reallocation. input.end() is taken much more often than necessary too, and could be easily cached.

Depending on how big the groups are assumed to be, equal_range may not be the most efficient choice. I assume it does a binary search to get logarithmic complexity, but if each group is only a couple of elements, a simple linear scan would have been faster. In any case, the initial sorting dominates the cost though.

edited Aug 27 '09 at 10:12

answered Aug 27 '09 at 10:06

jalf

243,077
51
345
550

1

After reading the STL document, I am thinking that no pre-sort container is good for situation, because they all seem to rely on 2 less-than comparison operations to determine equality, while one (==) operation is OK in my case and comparison is the only thing I want to save. – t.g. Aug 27 '09 at 10:37
What pre-sort container? I'm using a vector, and sorting it manually. – jalf Aug 27 '09 at 10:47
it seems sort() is in that category that relies on less_than operations, hence my comments. sorry for not having been clear. – t.g. Aug 27 '09 at 11:01
It relies on less than, yes, but it doesn't use that operation to emulate equality tests. It relies on less than *because it needs to perform less than operations*. You need a less than operation to sort elements. Once again, it is *not* used twice to emulate equality. In fact, I'd like you to show me a sorting algorithm, *any* sorting algorithm which relies on ==. ;) – jalf Aug 27 '09 at 11:38
it's quite resonable to rely on less-than. And that's the very reason sort is overkill in my case :-) – t.g. Aug 27 '09 at 11:51
Overkill? It is fewer comparisons than you need to work on an unsorted range. How would you find duplicate elements without sorting? – jalf Aug 27 '09 at 13:31
if you only want to tell two elements(a, b) are equal or not, to provide a operator==() is better than operator<(), because you could test if(a==b) rather than if (!(a – t.g. Aug 27 '09 at 14:25
"How would you find duplicate elements without sorting?" -- I use the semantic of equal instead of less-than or greater-than. – t.g. Aug 27 '09 at 14:27
Perhaps you should read up on sorting algorithms. The point is that if you sort the list, you never *need* to test for equality. The items are grouped automatically, using just less-than semantics. If you try to use == to find duplicates, you have to compare *every* pair of objects. That's O(N^2), in itself far more than the sorting. Try running both on paper. – jalf Aug 27 '09 at 18:26
both of us have the same understanding that "you never need to test for equality", and in reality, it's impossible to sort only with equality semantic. And my situation is to find equality, So using a method that has to deduce/mimic equality with its atomic operation(less-than in the case of sort()), is at least not optimal. Sort is just not optimal in finding equality not that it does not need less-than. – t.g. Aug 28 '09 at 03:28
Sort allows you to find duplicates in far fewer comparisons than a simple equality test. Finding duplicates in an unsorted range can only be done in one way: By comparing every element with every other element. That's (N^2)/2 equality tests. However, you can perform a sort in N lg(N) less-than comparisons, and once the range is sorted, you can determine duplicates in a simple linear scan, that is, exactly N equality comparisons. Unless your N is tiny, N lg(N)+N is going to add up to fewer comparisons than (N^2)/2. Hence sorting isn't such a bad idea. – jalf Aug 28 '09 at 10:26
You analysis is right in most cases. But in this case, you can erase one element from the container once it's identified as a duplicate. So as you go on, the list could be changing smaller, hence less comparison each round. In the best case when all elements are equal, you need (N-1) comparisons only; But in the worst case, you've said it all. Thanks for discussing so far. I did learn more about sorting. – t.g. Aug 28 '09 at 15:40

score 0 · Answer 11 · answered Jun 13 '11 at 00:48

Just to register that I had the same problem during a normalization of a triple store that I am working with. I implemented the method 3, summarized by Charles Bailey, in Common Lisp using the hash-table feature from Allegro Common Lisp.

The function "agent-equal?" is used to test when two agents in the TS are the same. The function "merge-nodes" merges the nodes on each cluster. In the code below, the "..." was used to remove the not so important parts.

(defun agent-equal? (a b)
 (let ((aprops (car (get-triples-list :s a :p !foaf:name)))
       (bprops (car (get-triples-list :s b :p !foaf:name))))
   (upi= (object aprops) (object bprops))))

(defun process-rdf (out-path filename)
 (let* (...
        (table (make-hash-table :test 'agent-equal?)))
  (progn 
   ...
   (let ((agents (mapcar #'subject 
          (get-triples-list :o !foaf:Agent :limit nil))))
     (progn 
       (dolist (a agents)
          (if (gethash a table)
            (push a (gethash a table))
            (setf (gethash a table) (list a))))
       (maphash #'merge-nodes table)
           ...
           )))))

what are the fast algorithms to find duplicate elements in a collection and group them?

Updates

Conclusion

Thanks for all who participating in the discussion.

Code

11 Answers11

edit

Linked