I am dealing with a dataset in which I want to remove duplicates. Duplicates are defined by having the same value for three fields stored as int64.
I am using C++17. I want my code to be as fast as possible (memory is less of a constraint). I do not care about ordering. I know nothing about the distribution of the int64 values.
My idea is to use an unordered_set with a hash of the three int64 as a key.
Here are my questions:
- Is the unordered_set the best option? How about a map?
- Which hash function should I use?
- Is it a good idea to put the three int64 into a string then hash that string?
Thanks for your help.