0

I am dealing with a dataset in which I want to remove duplicates. Duplicates are defined by having the same value for three fields stored as int64.

I am using C++17. I want my code to be as fast as possible (memory is less of a constraint). I do not care about ordering. I know nothing about the distribution of the int64 values.

My idea is to use an unordered_set with a hash of the three int64 as a key.

Here are my questions:

  1. Is the unordered_set the best option? How about a map?
  2. Which hash function should I use?
  3. Is it a good idea to put the three int64 into a string then hash that string?

Thanks for your help.

olivier
  • 1
  • 1
  • Put them in an `std::vector>`, sort the vector then apply `std::unique` – doug Feb 01 '22 at 23:39
  • Possibly relevant: [How to use unordered_set with custom types?](https://stackoverflow.com/questions/9729390/how-to-use-unordered-set-with-custom-types) – Wyck Feb 02 '22 at 05:27

3 Answers3

0

I would use:

std::unordered_map<uint64_t, std::unordered_map<uint64_t, std::unordered_set<uint64_t>>>
  1. Is the unordered_set the best option? How about a map?

Anything unordered_ (I believe) will use hash tables, ordered - some kind of binary tree.

  1. Which hash function should I use?

Whatever std:: provides for uint64_t, unless you have a reason to believe that you can do better.

  1. Is it a good idea to put the three int64 into a string then hash that string?

What can you do with strings that you can't with integers? It most likely will be longer...

Vlad Feinstein
  • 10,960
  • 1
  • 12
  • 27
0

Thanks, Vlad!

I also found an interesting implementation here: How do I combine hash values in C++0x?

inline void hash_combine(std::size_t& seed) { }

template <typename T, typename... Rest>
inline void hash_combine(std::size_t& seed, const T& v, Rest... rest) {
    std::hash<T> hasher;
    seed ^= hasher(v) + 0x9e3779b9 + (seed<<6) + (seed>>2);
    hash_combine(seed, rest...);
}

Usage:

std::size_t h=0;
hash_combine(h, obj1, obj2, obj3);

which I then plug into a flat std::unordered_set<std::size_t>.

This is roughly 2.5x faster than your proposal on my machine. On the other hand, your solution is simpler to read and does not require a hand-crafted hasher.

And then this solution is even faster:

bool insert_key_and_exists (const int64_t &a, const int64_t &b, std::unordered_set< std::pair<int64_t, int64_t> > *dict)
{
    auto c = std::make_pair(a,b);
    if (dict->contains(c))
        return true;
    dict->insert(c);
    return false;
}
olivier
  • 1
  • 1
0

Hash tables used in things like unordered_map are excellent for large objects. Especially if the key is smaller. But for this you would probably get the best speed using a vector. Sort it. Then run std::unique

#include <cstdint>
#include <vector>
#include <array>
#include <algorithm>
#include <iostream>

// sort and remove duplicates
void remove_dups(std::vector<std::array<int64_t, 3>>& v)
{
    std::sort(v.begin(), v.end());
    v.erase(std::unique(v.begin(), v.end()), v.end());
}

int main()
{
    std::vector<std::array<int64_t, 3>> v{ {1,2,3}, {3,2,1}, {1,2,3}, {2,3,4} };
    remove_dups(v);
    for (const auto& i3 : v)
    {
        for (const auto& i : i3)
            std::cout << i << " ";
        std::cout << '\n';
    }
}
doug
  • 3,840
  • 1
  • 14
  • 18