What is the most efficient way of removing duplicates from a container only using almost equality criteria (no sort)

Question

How do I remove duplicates from a non sorted container (mainly vector) when I do not have the possibility to define operator< e.g. when I can only define a fuzzy compare function.

This answer using sort does not work since I cannot define a function for ordering the data.

template <typename T>
void removeDuplicatesComparable(T& cont){
 for(auto iter=cont.begin();iter!=cont.end();++iter){
    cont.erase(std::remove(boost::next(iter),cont.end(),*iter),cont.end());
 }
}

This is O(n²) and should be quite localized concerning cache hits. Is there a faster or at least neater solution?

Edit: On why I cannot use sets. I do geometric comparisons. An example could be this but I have other entities different from polygons as well.

bool match(SegPoly const& left,SegPoly const& right,double epsilon){
  double const cLengthCompare = 0.1; //just an example
  if(!isZero(left.getLength()- right.getLength(), cLengthCompare)) return false;
  double const interArea =areaOfPolygon(left.intersected(right)); //this is a geometric intersection
  if(!isZero(interArea-right.getArea(),epsilon)) return false;
  else return true;
}

So for such comparisons I would not know how to formulate sorting or a neat hash function.

@EdHeal: Believe me I wouldn't if I had the choice. But there are many cases where you haven't (Reading files containing almost duplicate elements in my case arcs for example) — Martin, Jan 18 '14 at 23:51
You can do it in one pass with additional collection (preferably a hashmap). Simply for each element try adding it to map. If it already exists you should remove it while it's been already found... otherwise it shouldn't be in map. — SOReader, Jan 18 '14 at 23:51
Assuming the collection is not so large you can't have 2 copies, create a 2nd set that enforces uniqueness and delete items based on whetehr you could insert the item or not. — Joe, Jan 18 '14 at 23:52
Do you have to preserve order of the elements which are kept? — Ben Voigt, Jan 18 '14 at 23:52
@BenVoigt No order is not an issue, but as I said I cannot "sort" anyhow. Hashes might be an option as indicated in the comments ... — Martin, Jan 18 '14 at 23:53
[tag:stl] doesn't match `boost::next` as you're referring to BTW. (not sure, if it helps to add a [tag:boost] tag) — πάντα ῥεῖ, Jan 18 '14 at 23:55
If you can compute a hash, you can almost certainly produce *some sort* of order. (For example, you could start by ordering the hash values :) -- or more generally, sort by the bitstring which is hashed.) The fact that the order relationship has no meaning for the datatype does not invalidate its use as an order relationship. — rici, Jan 18 '14 at 23:55
Might as well just add everything to a new `set`. Removing items from containers would re-arrange memory and does a lot of copying — SwiftMango, Jan 18 '14 at 23:55
@texasbruce: if we accept the questioner's claim that it is not possible to order the elements, then it is not possible to add them to a `set`. — Steve Jessop, Jan 18 '14 at 23:59
@Rici This is a good idea: The hash could at least limit the search space, e.g hashing a lenth or sth. a like — Martin, Jan 19 '14 at 00:07
"So for such comparisons I would not know how to formulate sorting or a neat hash function." - That is your issue. You need to formulate either a sorting function, or a hash function. It is not that you *cannot* use `std::set` or `std::unordered_set`, it is that you do not know how to accomplish it. — Zac Howland, Jan 19 '14 at 00:12
@Zac: That's just not correct. He can certainly come up with mappings that help find neighbors, but there may not exist any mapping that has the same equivalence classes as his distance-based comparison, as a pure hashtable function would require. — Ben Voigt, Jan 19 '14 at 00:14
@BenVoigt In order to remove "duplicates", he would have to be able to test for equality. If he can test for equality, he can create a hash that would be the same for 2 "equal" items. — Zac Howland, Jan 19 '14 at 00:16
@ZacHowland I do not see any option to do fuzzy comparisons only using the ordering of a set or hashmap. Nonetheless it could limit the search space as BenVoigt indicated in his answer — Martin, Jan 19 '14 at 00:16
@Zac: Please propose a hash for `bool operator==(double a, double b) { return abs(a-b) < .001; }` And now for `bool operator==(Point2D a, Point2D b) { return (a.x-b.x)*(a.x-b.x) + (a.y-b.y)*(a.y-b.y) < .000001; }` Notice that these are reflexive and commutative but not transitive. — Ben Voigt, Jan 19 '14 at 00:19
@Martin A hash will return the same value for the same input. So if your "fuzzy" comparison (e.g. limiting the precision of a `double`) has 2 values that are "similar" (but you want to treat the same), your hash function would take the input at the desired precision level - making the comparison function treat them the same. — Zac Howland, Jan 19 '14 at 00:20
@Zac: But there are values which are close neighbors but lie either side of a decision boundary. I.e. `.0999999` and `.1000001`, where the "hash" is truncation to 4 decimals. Now we're looking for a data structure that does not find collisions, but narrows the candidates for collision, i.e. oct-tree territory, and hence my answer. — Ben Voigt, Jan 19 '14 at 00:21
@BenVoigt You do not have to use truncation, and your `Point2D` version does not calculate distance, so of course it is going to be all over the place. And a hash function isn't going to return `bool` ... — Zac Howland, Jan 19 '14 at 00:24
@Zac: Of course a hash will not return `bool`. The equality function does though. If truncation isn't best, suggest something better. The `Point2D` version applies a bound on the 2-norm (it actually computes only the square of the norm, but that's a convex transformation and does not change the comparison result) — Ben Voigt, Jan 19 '14 at 00:34
@BenVoigt So you ask me to provide a hash function at the same time you are discussing equality functions? — Zac Howland, Jan 19 '14 at 00:42
@Zac: Your words: "If he can test for equality, he can create a hash that would be the same for 2 "equal" items." I claim it isn't so simple, and provided a pair of "equality functions" for which no such hashes exist. Lack of transitivity forbids the existence of a hash. — Ben Voigt, Jan 19 '14 at 00:44
@BenVoigt Ah, I see where you were going. However, you can make a simple hash function for `Point2D` that contains 2 doubles simply by multiplying each double by the inverse of the desired precision + 1, convert the result to an integer, divide by 10, check the remainder (if >= 5, round up). Store the `x` portion in the upper 32-bits of an `unsigned long long` and `y` portion in the lower 32-bits. Then use `std::hash`. — Zac Howland, Jan 19 '14 at 01:04
@Zac: But you still end up with "equal" values with different hashes. — Ben Voigt, Jan 19 '14 at 01:22
@BenVoigt Only if you use different values for your desired precision and the precision you use for equality; if you define them to be the same, you're effectively converting a continuous Cartesian plane into a discrete Cartesian plane. — Zac Howland, Jan 19 '14 at 01:26
@Zac: It's the difference between `round(a,.001) - round(b,.001)` and `round(a-b,.001)` — Ben Voigt, Jan 19 '14 at 01:27
@Zac: No it's not. Let `a = 1.0004, b = 1.0010, c = 1.0018`, with the comparison function above (`return abs(x-y) < .001`). What would be `hash(a), hash(b), hash(c)`. Note that `a == b` and `b == c` but `!(a == c)`. So the hash *cannot* exist. — Ben Voigt, Jan 19 '14 at 01:37

Ben Voigt · Accepted Answer · 2014-01-19T00:15:50.427

3

First, don't remove elements one at a time.

Next, use a hash table (or similar structure) to detect duplicates.

If you don't need to preserve order, then copy all elements into a hashset (this destroys duplicates), then recreate the vector using the values left in the hashset.

If you need to preserve order, then:

Set read and write iterators to the beginning of the vector.
Start moving the read iterator through, checking elements against a hashset or octtree or something that allows finding nearby elements quickly.
For each element that collides with one in the hashset/octtree, advance the read iterator only.
For elements that do not collide, move from read iterator to write iterator, copy to hashset/octtree, then advance both.
When read iterator reaches the end, call erase to truncate the vector at the write iterator position.

The key advantage of the octtree is that while it doesn't let you immediately determine whether there is something close enough to be a "duplicate", it allows you to test against only near neighbors, excluding most of your dataset. So your algorithm might be O(N lg N) or even O(N lg lg N) depending on the spatial distribution.

Again, if you don't care about the ordering, you can actually move survivors into the hashset/octtree and at the end move them back into the vector (compactly).

edited Jan 19 '14 at 00:15

answered Jan 18 '14 at 23:51

Ben Voigt

277,958
43
419
720

*"First, don't remove elements one at a time."* Are you referring to `vector::erase`? – dyp Jan 18 '14 at 23:54
What do you mean by one at a time. doing the erase only in the end ? – Martin Jan 18 '14 at 23:54
1

@dyp: I'm referring the fact that if element #2 is removed and elements #3 is also removed, element #4 is getting moved twice. It should be moved once, directly to index #2. – Ben Voigt Jan 18 '14 at 23:56
@Martin: It is a bit more than that, but that would help. – Ben Voigt Jan 18 '14 at 23:56
Hashset could be good but I hvae to calculate a hash. In my case i compare geometrical entities for almost equality. How would you formulate almost equal into a hasFunction (Eg. almost identical polygons) – Martin Jan 18 '14 at 23:57
@Martin: I'd use an octtree to store elements, that way I can find ones with nearby vertices efficiently, and need to compare only a tiny number for collision. – Ben Voigt Jan 19 '14 at 00:00
@Steve: He did call it a "fuzzy compare", that's why I started putting in octtree references. – Ben Voigt Jan 19 '14 at 00:02
@Martin: fuzzy equal does not allow a deterministic answer to your desire to remove "duplicates". – rici Jan 19 '14 at 00:03
@rici: The result may not be deterministic (or may not be input-order-independent). For example, if you're removing floating-point numbers which are "duplicate" to within 1/1000, then given the input `{ 1.0004, 1.0010, 1.0018 }` then both `{ 1.0010 }` and `{ 1.0004, 1.0018 }` are legitimate "correct" results of de-duplication. – Ben Voigt Jan 19 '14 at 00:06
@SteveJessop: True, I added an example for clarification – Martin Jan 19 '14 at 00:09
In effect the question is asking for a cover of the input, where each element of the cover is the neighbourhood of a point in the input, defined by the "nearby" / "almost equal" / "fuzzy compare" function, and no central point of a neighbourhood is a member of any other neighbourhood in the result. So sure, of course there may be multiple such covers. Finding the one with the least elements (if required) is (I think) complexity-harder than just finding one. – Steve Jessop Jan 19 '14 at 00:10
@Steve: In addition, it's debatable whether points in the cover should be exactly points in the input, or some mean of the set of points they covered (which might allow an even lower cardinality solution). The goal might be the set with minimum cardinality and total lowest norm-ball error over all inputs. Finding "one" solution is indeed much easier than finding the optimal. – Ben Voigt Jan 19 '14 at 00:12
@BenVoigt: Precisely; that's one example. – rici Jan 19 '14 at 00:24
@SteveJessop: alternatively, the question may be to find a subset whose minimum pair-wise distance is greater than some threshold. Assuming the distance function obeys the triangle inequality, that's not too difficult, I believe. If the subset must be minimal or maximal in size, the difficulty increases. – rici Jan 19 '14 at 00:26
@rici: mmm, you can play the game differently according to whether you assume that the "almost-equal" function defines a metric space or merely an arbitrary topology (or something that isn't even a topology) :-) – Steve Jessop Jan 19 '14 at 00:29
@SteveJessop: Hence my reference to the triangle inequality. – rici Jan 19 '14 at 00:30
1

And this is why "Computer Science" is considered a branch of theoretical mathematics... – Ben Voigt Jan 19 '14 at 00:30

Zac Howland · Answer 2 · 2014-01-19T00:52:13.353

-3

If you don't want to rewrite your code to prevent duplicates from being placed in the vector to begin with, you can do something like this:

std::vector<Type> myVector;
// fill in the vector's data
std::unordered_set<Type> mySet(myVector.begin(), myVector.end());
myVector.assign(mySet.begin(), mySet.end());

Which will be of O(2 * n) = O(n).

std::set (or std::unordered_set - which uses a hash instead of a comparison) doesn't allow for duplicates, so it will eliminate them as the set is initialized. Then you re-assign the vector with the non-duplicated data.

Since you are insisting that you cannot create a hash, another alternative is to create a temporary vector:

std::vector<Type> vec1;
// fill vec1 with your data
std::vector<Type> vec2;
vec2.reserve(vec1.size()); // vec1.size() will be the maximum possible size for vec2
std::for_each(vec1.begin(), vec1.end(), [&](const Type& t)
{
    bool is_unique = true;
    for (std::vector<Type>::iterator it = vec2.begin(); it != vec2.end(); ++it)
    {
        if (!YourCustomEqualityFunction(s, t))
        {
            is_unique = false;
            break;
        }
    }

    if (is_unique)
    {
        vec2.push_back(t);
    }
});

vec1.swap(vec2);

If copies are a concern, switch to a vector of pointers, and you can decrease the memory reallocations:

std::vector<std::shared_ptr<Type>> vec1;
// fill vec1 with your data
std::vector<std::shared_ptr<Type>> vec2;
vec2.reserve(vec1.size()); // vec1.size() will be the maximum possible size for vec2
std::for_each(vec1.begin(), vec1.end(), [&](const std::shared_ptr<Type>& t)
{
    bool is_unique = true;
    for (std::vector<Type>::iterator it = vec2.begin(); it != vec2.end(); ++it)
    {
        if (!YourCustomEqualityFunction(*s, *t))
        {
            is_unique = false;
            break;
        }
    }

    if (is_unique)
    {
        vec2.push_back(t);
    }
});

vec1.swap(vec2);

edited Jan 19 '14 at 00:52

answered Jan 18 '14 at 23:57

Zac Howland

15,777
1
26
42

As I said, set is NOT an option. I cannot formulate a sensible less condition on the data at hand – Martin Jan 18 '14 at 23:58
@Martin You can change the comparison operation when you create the set. It is the second template parameter. Or use `std::unordered_set`. – Zac Howland Jan 18 '14 at 23:59
@Zac: The comparison operator still needs to have the behavior of a total ordering. How would you do that for multi-dimensional data? – Ben Voigt Jan 19 '14 at 00:01
@BenVoigt Did I miss something? Where is he talking about multi-dimensional data? – Zac Howland Jan 19 '14 at 00:03
1

@BenVoigt: the axiom of choice implies that every set is well-ordered. So one exists, the difficulty is finding it ;-) – Steve Jessop Jan 19 '14 at 00:03
@Steve: It doesn't mean that an ordering exists which is useful for this removal process. – Ben Voigt Jan 19 '14 at 00:04
1

@ZacHowland _'I'm failing to understand the 3 down-votes.'_ Group psychology effects may be ;-) : _“In order to be an immaculate member of a flock of sheep, one must above all be a sheep oneself.” A.E._ – πάντα ῥεῖ Jan 19 '14 at 00:11
@BenVoigt: I know, that's one of the reasons it's just a joke. Another is that the proof of the well-order principle from the axiom of choice provides no algorithmic construction of an order. The last is that any type that can be represented in real hardware has only finitely many members anyway. What you would need to use a total order to perform this operation is actually a space-filling curve: if you don't already know one due to the topology of the type in question, then inventing one can be a pretty tall order – Steve Jessop Jan 19 '14 at 00:13
@ZacHowland: downvote in my case is because you started with an answer that ignored the major constraint in the question that makes the task non-trivial. When this was pointed out you changed to an answer that's redundant given Ben's answer, so I haven't removed the downvote. – Steve Jessop Jan 19 '14 at 00:19
@SteveJessop When I answered, that constraint was not in the question, and Ben's answer did not contain a code example. – Zac Howland Jan 19 '14 at 00:30
@Zac: The earliest recorded version of the question, well before your answer, said "I do not have the possibility to define operator< e.g. when I can only define a fuzzy compare function. This answer using sort does not work since I cannot define a function for ordering the data." – Ben Voigt Jan 19 '14 at 00:32
@BenVoigt And I pointed out that you do not have to use `less` - you can define your own comparison function. – Zac Howland Jan 19 '14 at 00:40
@Zac: But `std::sort` and `std::set` won't work right with a "fuzzy compare" function, it has to be transitive, anti-symmetric, etc. – Ben Voigt Jan 19 '14 at 00:41
On your latest edit: You use the copy constructor (might be quite expensive) and you use assign instead of swapping. How will this be any more effective than just removing elements and thus freeing memory ? – Martin Jan 19 '14 at 00:43
@BenVoigt Hence the mention of `std::unordered_set`, which just requires the computation of a hash. And even the edit which provides for usage of just his "fuzzy comparison" function. – Zac Howland Jan 19 '14 at 00:44
@Martin If copy-construction is your worry, you should be storing them as a vector of pointers, not a vector of objects. Every time the vector has to resize, you are calling copy-constructors. You can use the swap-idiom here as well (in fact, it would probably be more efficient). The effectiveness will depend on how many duplicates you actually have. A modified version without the copies will be updated momentarily. – Zac Howland Jan 19 '14 at 00:47
@Zac: Your latest code example is better (although `vec2.push_back(t)` should be improved to `vec2.emplace_back(std::move(t))`... but this requires explicit iteration instead of `std::for_each`). Still `O(N^2)` however. – Ben Voigt Jan 19 '14 at 00:47
2

@Martin: Every time you remove an element `O(N)` of these, you move or copy `O(N)` elements that follow it. That's expensive, and building just the vector of survivors is much cheaper. That can also be done in-place in the original vector as I mentioned in my answer. – Ben Voigt Jan 19 '14 at 00:49
@BenVoigt It is only `O(N^2)` in the worst case (e.g. if there are no duplicates at all). Depending on the data, the average case could be closer to `O(N log N)`. How would `std::move` work when inserting data from one contiguous block of memory into another? – Zac Howland Jan 19 '14 at 00:57
@Zac: Using move constructor gives the opportunity to steal owned resources instead of cloning. And yours is definitely `O(Nk)` where `k` is the number of survivors. Which is `O(N^2)` if `k` is linear in `N`, which may or may not be a reasonable assumption. While an octtree gives more like `O(N lg k)`. – Ben Voigt Jan 19 '14 at 00:59
@BenVoigt Again, only in the *worst* case. If the duplicate is earlier in the list of survivors, it breaks out (which decreases the complexity of the best and average cases). If the entire vector contains nothing but duplicates (the best case), the complexity is O(N * 1) (which is the same as using the tree). – Zac Howland Jan 19 '14 at 01:14

What is the most efficient way of removing duplicates from a container only using almost equality criteria (no sort)

2 Answers2