I believe the simplest method would be to use std::unordered_set
and exploit its second and third template
parameters.
Method
- Define a hash function
This step goal is a provide a " pre-filtering " step which eliminates obvious duplicates according to the meaning in the context (so here, v1
and -v1
should for example have the same hash).
That's something that should be on a per-class basis . There is no way to come up with a generic performant hashing function, though non-performant critical application may use the first hasher below (but I won't really recommend that anymore).
a. The bad hasher
This is what I originally proposed, before taking comment from @axnsan and @François Andrieux in count.
The simplest hasher I can think of looks like that:
struct bad_hasher
{
std::size_t operator()(const value_type&) const
{
return 0;
}
};
It makes all hash collide and forces std::unordered_set
to use KeyEqual
to determine objects equality.
So indeed, that works, but that does not make it a good choice. @axnsan and @François Andrieux pointed the following drawbacks in the comment below:
- "it changes its performance characteristics to O(n^2) (it will have to iterate through all elements on each lookup) "(- @axnsan)
- "[it converts] the set into a simple unsorted linked list. Every element will collide with every other element, it it looks like typical implementations use collision chaining". (- @François Andrieux)
In other words, this makes std::unordered_set
become the same as a std::vector
+ std::remove_if
.
b. The better hasher
@axnsan suggests the following hasher for this particular application:
struct better_hasher
{
std::size_t operator()(const vec3& v) const
{
return static_cast<std::size_t>(std::abs(v.x) + std::abs(v.y) + std::abs(v.z));
}
};
It fills the following requirements:
better_hasher(v) == better_hasher(-v)
.
v1 != v2
=> better_hasher(v1) != better_hasher(v2)
in most cases ((1, 0, 1)
will collide with (1, 1, 0)
for example)
- not all hashes collide.
- removes obvious duplicates.
Which is probably somewhere near the best we could hope for in this configuration.
We then need to remove those "false positives" due to hash collisions.
- Define a key equality predicate
The goal here is to remove duplicates that were not detected by the hasher (here, typically vectors such as (1, 0, 1) / (1, 1, 0)
or overflow).
Declare a predicate struct
which roughly looks like:
struct key_equal
{
bool operator()(const value_type& a, const value_type& b) const
{
return (a == b) || <...>;
}
};
Where <...>
is anything making two values identical in the current context ( so here a == b) || -a == b
for example).
Note that this expects operator==
to be implemented.
- Erase duplicates
Declare an std::unordered_set
which removes duplicates:
const std::unordered_set<value_type, hasher, key_equal> s(container.begin(), container.end());
container.assign(s.begin(), s.end());
- (alt) Erase duplicates (and conserve original order in container)
Basically the same, but this checks if an element can be inserted in the std::unordered_set
, and if does not, remove it. Adapted from @yury's answer in What's the most efficient way to erase duplicates and sort a vector?.
std::unordered_set<value_type, hasher, comparator> s;
const auto predicate = [&s](const value_type& value){return !s.insert(value).second;};
container.erase(std::remove_if(container.begin(), container.end(), predicate), container.end());
Generic (container-independent) templated function:
template<typename key_equal_t, typename hasher_t, bool order_conservative, typename container_t>
void remove_duplicates(container_t& container)
{
using value_type = typename container_t::value_type;
if constexpr(order_conservative)
{
std::unordered_set<value_type, hasher_t, key_equal_t> s;
const auto predicate = [&s](const value_type& value){return !s.insert(value).second;};
container.erase(std::remove_if(container.begin(), container.end(), predicate), container.end());
}
else
{
const std::unordered_set<value_type, hasher, key_equal_t> s(container.begin(), container.end());
container.assign(s.begin(), s.end());
}
}
Expects to be provided key_equal_t
and hasher_t
(and a bool
known a compile time indicating if you care about element being kept in the same order or not). I did not benchmark any of the two branches in this function so I do not know if one is significantly better than another, though this answer seems show manual insertion may be faster.
Example in this use case:
/// "Predicate" indicating if two values are considered duplicates or not
struct key_equal
{
bool operator()(const vec3& a, const vec3& b) const
{
// Remove identical vectors and their opposite
return (a == b) || (-a == b);
}
};
/// Hashes a vec3 by adding absolute values of its components.
struct hasher
{
std::size_t operator()(const vec3& v) const
{
return static_cast<std::size_t>(std::abs(v.x) + std::abs(v.y) + std::abs(v.z));
}
};
remove_duplicates<key_equal, hasher, order_conservative>(vec);
Test data:
vec3 v1{1, 1, 0}; // Keep
vec3 v2{0, 1, 0}; // Keep
vec3 v3{1, -1, 0}; // Keep
vec3 v4{-1, -1, 0}; // Remove
vec3 v5{1, 1, 0}; // Remove
std::vector vec{v1, v2, v3, v4, v5};
Test 1: non-order-conservative:
// ...<print vec values>
remove_duplicates<key_equal, hasher, false>(vec);
// ... <print vec values>
Output (before / after):
(1, 1, 0) (0, 1, 0) (1, -1, 0) (-1, -1, 0) (1, 1, 0)
(1, -1, 0) (0, 1, 0) (1, 1, 0)
Test 2: order-conservative:
// ... <print vec values>
remove_duplicates<key_equal, hasher, true>(vec);
// ... <print vec values>
Output (before / after):
(1, 1, 0) (0, 1, 0) (1, -1, 0) (-1, -1, 0) (1, 1, 0)
(1, 1, 0) (0, 1, 0) (1, -1, 0)