How to remove duplicates in a vector based on an equality predicate?

Question

I have a struct which roughly looks like that:

struct vec3
{
    int x;
    int y;
    int z;

    constexpr bool operator==(const vec3& v) const
    {
        return (x == v.x) && (y == v.y) && (z == v.z);
    }

    constexpr vec3 operator-() const
    {
        return {-x, -y, -z};
    }
};

I then generate a std::vector of vec3 with random values for each coordinates. The function in which it is used requires that no couple of values {v1, v2} in that vector verifies v1 == -v2. I obviously need that definition of operator== (the one in the snippet above) elsewhere in code, otherwise that problem would be trivial.

I first attempted std::set and std::sort + std::unique, but could not find any way to have a struct filing named requirements Compare for that application (which is needed for both approaches).

How can I proceed?

Note:

This is somewhat different from Removing duplicates from a non-sortable vector in which pointers are sorted, and also from C++ how to remove duplicates from vector of Class type? in which the elements are sortable according to some criteria (I think).

Erel · Accepted Answer · 2022-04-20T18:15:47.330

I believe the simplest method would be to use std::unordered_set and exploit its second and third template parameters.

Method

Define a hash function

This step goal is a provide a " pre-filtering " step which eliminates obvious duplicates according to the meaning in the context (so here, v1 and -v1 should for example have the same hash).

That's something that should be on a per-class basis . There is no way to come up with a generic performant hashing function, though non-performant critical application may use the first hasher below (but I won't really recommend that anymore).

a. The bad hasher

This is what I originally proposed, before taking comment from @axnsan and @François Andrieux in count.

The simplest hasher I can think of looks like that:

struct bad_hasher
{
    std::size_t operator()(const value_type&) const
    {
        return 0;
    }
};

It makes all hash collide and forces std::unordered_set to use KeyEqual to determine objects equality. So indeed, that works, but that does not make it a good choice. @axnsan and @François Andrieux pointed the following drawbacks in the comment below:

"it changes its performance characteristics to O(n^2) (it will have to iterate through all elements on each lookup) "(- @axnsan)
"[it converts] the set into a simple unsorted linked list. Every element will collide with every other element, it it looks like typical implementations use collision chaining". (- @François Andrieux)

In other words, this makes std::unordered_set become the same as a std::vector + std::remove_if.

b. The better hasher

@axnsan suggests the following hasher for this particular application:

struct better_hasher
{
    std::size_t operator()(const vec3& v) const
    {
        return static_cast<std::size_t>(std::abs(v.x) + std::abs(v.y) + std::abs(v.z));
    }
};

It fills the following requirements:

better_hasher(v) == better_hasher(-v).
v1 != v2 => better_hasher(v1) != better_hasher(v2) in most cases ((1, 0, 1) will collide with (1, 1, 0) for example)
not all hashes collide.
removes obvious duplicates.

Which is probably somewhere near the best we could hope for in this configuration.

We then need to remove those "false positives" due to hash collisions.

Define a key equality predicate

The goal here is to remove duplicates that were not detected by the hasher (here, typically vectors such as (1, 0, 1) / (1, 1, 0) or overflow).

Declare a predicate struct which roughly looks like:

struct key_equal
{
    bool operator()(const value_type& a, const value_type& b) const
    {
        
        return (a == b) || <...>;
    }
};

Where <...> is anything making two values identical in the current context ( so here a == b) || -a == b for example).

Note that this expects operator== to be implemented.

Erase duplicates

Declare an std::unordered_set which removes duplicates:

const std::unordered_set<value_type, hasher, key_equal> s(container.begin(), container.end());
container.assign(s.begin(), s.end());

(alt) Erase duplicates (and conserve original order in container)

Basically the same, but this checks if an element can be inserted in the std::unordered_set, and if does not, remove it. Adapted from @yury's answer in What's the most efficient way to erase duplicates and sort a vector?.

std::unordered_set<value_type, hasher, comparator> s;

const auto predicate = [&s](const value_type& value){return !s.insert(value).second;};

container.erase(std::remove_if(container.begin(), container.end(), predicate), container.end());

Generic (container-independent) templated function:

template<typename key_equal_t, typename hasher_t, bool order_conservative, typename container_t>
void remove_duplicates(container_t& container)
{
    using value_type = typename container_t::value_type;

    if constexpr(order_conservative)
    {
        std::unordered_set<value_type, hasher_t, key_equal_t> s;
        const auto predicate = [&s](const value_type& value){return !s.insert(value).second;};
        container.erase(std::remove_if(container.begin(), container.end(), predicate), container.end());
    }
    else
    {
        const std::unordered_set<value_type, hasher, key_equal_t> s(container.begin(), container.end());
        container.assign(s.begin(), s.end());
    }
}

Expects to be provided key_equal_t and hasher_t (and a bool known a compile time indicating if you care about element being kept in the same order or not). I did not benchmark any of the two branches in this function so I do not know if one is significantly better than another, though this answer seems show manual insertion may be faster.

Example in this use case:

/// "Predicate" indicating if two values are considered duplicates or not
struct key_equal
{
    bool operator()(const vec3& a, const vec3& b) const
    {
        // Remove identical vectors and their opposite
        return (a == b) || (-a == b);
    }
};

/// Hashes a vec3 by adding absolute values of its components.
struct hasher
{
    std::size_t operator()(const vec3& v) const
    {
        return static_cast<std::size_t>(std::abs(v.x) + std::abs(v.y) + std::abs(v.z));
    }
};

remove_duplicates<key_equal, hasher, order_conservative>(vec);

Test data:

vec3 v1{1, 1, 0};   // Keep
vec3 v2{0, 1, 0};   // Keep
vec3 v3{1, -1, 0};  // Keep
vec3 v4{-1, -1, 0}; // Remove
vec3 v5{1, 1, 0};   // Remove

std::vector vec{v1, v2, v3, v4, v5};

Test 1: non-order-conservative:

// ...<print vec values>
remove_duplicates<key_equal, hasher, false>(vec);
// ... <print vec values>

Output (before / after):

(1, 1, 0) (0, 1, 0) (1, -1, 0) (-1, -1, 0) (1, 1, 0)
(1, -1, 0) (0, 1, 0) (1, 1, 0)

Test 2: order-conservative:

// ... <print vec values>
remove_duplicates<key_equal, hasher, true>(vec);
// ... <print vec values>

Output (before / after):

(1, 1, 0) (0, 1, 0) (1, -1, 0) (-1, -1, 0) (1, 1, 0) 
(1, 1, 0) (0, 1, 0) (1, -1, 0)

Defining the hash function to return the same value for all objects completely defeats the purpose of using an unordered_set as it changes its performance characteristics to O(n^2) (it will have to iterate through all elements on each lookup) — axnsan, Apr 20 '22 at 14:56
A better hash function would combine the absolute values of all vector components - this still guarantees that "equal" vectors (as defined in the original question) get equal hashes while minimizing the pairs of "unequal" vectors that get the same hash. — axnsan, Apr 20 '22 at 15:07
That's indeed a much better approach (as suggested by @eerorika above) for this particular use case, assuming overflow is properly handled (that's a concern in my case I did not think was relevant when writing my question). But I don't think it would possible for something generic to have a performant approach like that, would it? At least not with non-user defined hash. By generic, I mean a structure which is harder to hash cleverly for such a problem. I guess I tried to be too generic in this answer anyway, effectively sacrificing performance. — Erel, Apr 20 '22 at 15:22
You are indeed right, it would be impossible to generically derive a suitable hash for any user-supplied equality predicate. It's the user's responsibility to provide a hash function that respects the equality relation (that is: a == b must imply hash(a) == hash(b)) while also minimizing collisions (i.e. hash(a) == hash(b) does not need to guarantee a == b, but the hash function should strive to minimise its miss rate). While "return 0" is a technically valid hash function it completely removes any gains from using a hash table. — axnsan, Apr 20 '22 at 15:29
Also of note is that eerorika's answer is wrong for the question as stated, since it will treat any combination of positive/negative elements as equal, e.g. it would treat {a, b, c} as equal to {a, -b, c}, while the question asks for just {a, b, c} == {-a, -b, -c}. The hash table based solution with my proposed hash function doesn't have this problem, since even though the hashes wold be erroneously equal, the predicate function can eliminate false positives by applying the correct comparison. — axnsan, Apr 20 '22 at 15:32
I think you should gather all of that in an answer. That may help someone in the future (and do not hesitate to point where I was wrong in mine ^-^) — Erel, Apr 20 '22 at 15:34
By using an identity hash function `return 0;` you are converting the set into a simple unsorted linked list. Every element will collide with every other element, it it looks like typical implementations use [collision chaining](https://stackoverflow.com/a/21519560/7359094). Every lookup is a linear search through the entire set. You might as well just use `std::vector` and use `std::find_if` to search through it. `unordered_set` is just getting in the way here. — François Andrieux, Apr 20 '22 at 15:47
@FrançoisAndrieux Thank you for your time. This has been explained in details with better alternatives by axnsan in his previous comment. — Erel, Apr 20 '22 at 15:51
@Erel I feel those comments did not sufficiently express the downside. It focuses on why a good hash function is important but I don't think it mentions that it turns the set into a list (at least in terms of look-up performance). — François Andrieux, Apr 20 '22 at 15:53
@Erel I think this answer is perfectly fine if you just add a good hash function. Simply combining the 3 vector values with something similar to https://stackoverflow.com/questions/35985960/c-why-is-boosthash-combine-the-best-way-to-combine-hash-values should suffice — axnsan, Apr 20 '22 at 16:05
@axnsan Thank for your feedback. I will edit my answer and include what you suggested and noticed your comments. — Erel, Apr 20 '22 at 16:59
@FrançoisAndrieux You're right. I extrapolated that but it's indeed not expressed as much as you did in the previous comments. — Erel, Apr 20 '22 at 17:01

eerorika · Answer 2 · 2022-04-20T15:16:29.683

1

The function in which it is used requires that no couple of values {v1, v2} in that vector fills v1 == -v2

but could not find any way to have a struct filing named requirements Compare for that application (which is needed for both approaches).

It seems to me that you're trying to solve X, but this is the Y in your XY-problem.

It's fairly simple to implement an ordered comparator that satisfies the equality of -v == v. Simply compare absolute values:

struct vec3
{
    int x;
    int y;
    int z;

    // normal comparison that treats -x != x
    friend auto operator<=>(const vec3&, const vec3&) = default;
};

// declare in same namespace as vec3 for ADL
vec3 abs(const vec3& v) {
    return {std::abs(v.x), std::abs(v.y), std::abs(v.z)};
}


struct abs_less {
    template< class T, class U>
    constexpr auto operator()( T&& lhs, U&& rhs ) const
        -> decltype(std::forward<T>(lhs) < std::forward<U>(rhs))
    {
        using std::abs; // for integers
        return abs(lhs) < abs(rhs); // this implementation assumes normal comparison operator
        // you can implement logic directly here if that's not possible
    }
};

This comparator works with both std::set and std::sort + std::unique. Example with set:

std::set<vec3, abs_less> example;

Of course, you could overload the operator< directly and use std::less, but usually I would recommend against non-defaulted operator overloads that have unusual behaviour.

edited Apr 20 '22 at 15:16

answered Apr 20 '22 at 14:30

eerorika

232,697
12
197
326

Humm... I get a compile error in the current version (revision 3), which boils down to `/usr/include/c++/11/bits/stl_tree.h:762:23: note: ‘std::__is_invocable{}’ evaluates to false`. Test code is `const std::set s(vec.begin(), vec.end());` `vec.assign(s.begin(), s.end());`, where `vec` is generated the same as in my answer. I did not change `vec3` definition either. I compile under gcc (Ubuntu 11.2.0-7ubuntu2) 11.2.0. – Erel Apr 20 '22 at 14:54
@Erel There might be a bug in the details, I didn't test thoroughly. – eerorika Apr 20 '22 at 14:55
Okay nevermind, I'm stupid. Basically `some_namespace::abs` returns `vec3`, thus current version (revision 3) is equivalent to have `return v1 < v2;`. But because I did not define that `operator<` (I agree with you, I'd like to avoid that), well, this does not compile. I get the idea behind though, thank you. – Erel Apr 20 '22 at 15:09
@Erel Oh yes of course. This implementation assumes that there is a "normal" `operator<` for the class. – eerorika Apr 20 '22 at 15:11
There's something else @axnsan mentioned in one comment on another answer (small rewording here mine). *This approach (as of revision 4) will result in `{a, b, c} <==> {a, -b, c}`, while the question asks for `{a, b, c} <==> {-a, -b, -c}`*. – Erel Apr 20 '22 at 19:31
@Erel Oh yea, that's a good point. I think that (the desired logic) may be also implementable, but I think it's more complex. – eerorika Apr 20 '22 at 19:37