5

What is the purpose of 3rd parameter KeyEqual in std::unordered_set? Isn't hash uniqueness enough?

template<
    class Key,
    class Hash = std::hash<Key>,
    class KeyEqual = std::equal_to<Key>,
    class Allocator = std::allocator<Key>
> class unordered_set;

Sorry if this question sounds naive. Moving to C++ from Python/PHP :)

For now my implementations of KeyEqual always duplicates Hash impl. So I was wonder if I do it correctly.

gsamaras
  • 71,951
  • 46
  • 188
  • 305
spajak
  • 553
  • 1
  • 9
  • 19
  • 3
    Never heard of hash collisions? If two objects produce the same hash, then that equality predicate will be used to determine whether they compare equal. – Praetorian Aug 23 '16 at 17:36
  • 1
    Hash uniqueness enough? What if your key is `int` and your hash function is `[](int i){ return i % 10; }`? – Ryan Haining Aug 23 '16 at 17:36
  • What is wrong with documentation of [`unordered_set`](http://www.cplusplus.com/reference/unordered_set/unordered_set/)? hash collisions from previous comments is reason N 1, but nearly all stl container allow customization of comparisson operation. What if yu need YOUR way to compare keys or you use a key type without comparisson operator pre-existing? – mvidelgauz Aug 23 '16 at 17:38
  • then it should be replaced, like table[0] = "foo" replaces 1st element in the table – spajak Aug 23 '16 at 17:39
  • 3
    @mvidelgauz FYI, cplusplus.com has some [problems](http://stackoverflow.com/questions/6520052/whats-wrong-with-cplusplus-com). Use [cppreference](http://en.cppreference.com/w/cpp/container/unordered_set) instead. – jaggedSpire Aug 23 '16 at 17:39
  • @spajak that breaks the guarantee of the set. – jaggedSpire Aug 23 '16 at 17:40
  • @jaggedSpire it may be unordered_map in the question as well – spajak Aug 23 '16 at 17:43
  • @jaggedSpire Thank you for this information. I didn't know about problems, will read about it, but I usualy use both (+ some others in some cases) – mvidelgauz Aug 23 '16 at 17:44
  • @spajak `unordered_set`, conceptually, is also a set. – jaggedSpire Aug 23 '16 at 17:45
  • specifically, when you add an element to a dynamic set, the resulting set should have the union of all the previous members of the set, and a set containing only the new element: previous elements should *not* be removed. If an *identical* element to the added element already existed, the resulting set would be unchanged. What you're proposing would make the result of {2,5,7,3} add {13} {2,5,7,13}, instead of the correct result of {2,3,7,3,13}, if the hash was i % 10. – jaggedSpire Aug 23 '16 at 17:51
  • Ultimately, hashing a value is a quick way to find out where, approximately, it should go in the set. It's much like the zip code in an address--it doesn't uniquely identify the value. It *does* get you close enough to be able to exhaustively search candidates for matches without an insane investment of time. In the zip code analogy, if you were looking for an address you could at that point search the nearby houses for the address you wanted, without spending more then, say, a day on the task, instead of the unimaginable amount of time required to look through all the houses in a country. – jaggedSpire Aug 23 '16 at 17:58
  • @spajak: "Isn't hash uniqueness enough"? Hash uniqueness would definitely be enough. But there's no hash uniqueness in general case. Moreover, the whole idea if hash-based data structures revolves around the issue of dealing with hash collisions efficiently. – AnT stands with Russia Aug 23 '16 at 18:10

4 Answers4

6

Let's take for example, a set of ints with a hash function that just does a simple mod % operation

struct IntMod {
    constexpr std::size_t operator()(int i) const { return i % 10; }
};

std::unordered_set<int, IntMod> s;

This can easily result in a hash collision, and when that happens you need to be able to compare the keys to know if a key is already present.

s.insert(25);  // hash == 5
s.insert(35);  // hash == 5
assert(*s.find(25) == 25);  // both 25 and 35 are present despite the same hash
assert(*s.find(35) == 35);

If we add a KeyEqual that just uses the hash function as well (like you suggest it do by default), it breaks on the second insertion.

struct IntEq {
  constexpr bool operator()(int a, int b) const {
    return IntMod{}(a) == IntMod{}(b);
  }
};

std::unordered_set<int, IntMod, IntEq> s;
s.insert(25);  // hash == 5
s.insert(35);  // hash == 5
assert(*s.find(25) == 25);
assert(*s.find(35) == 35);  // now this fails. s.find(35) returns iterator to 25
Ryan Haining
  • 35,360
  • 15
  • 114
  • 174
5

But what if we have a hash collision?

enter image description here

The picture demonstrates the case where two different elements, happen to have equal hash values. As a result, the hash value may not be unique when it comes to hashing.


Quoting the ref of std::unordered_set:

Internally, the elements in the unordered_set are not sorted in any particular order, but organized into buckets depending on their hash values to allow for fast access to individual elements directly by their values (with a constant average time complexity on average).

So a bucket can have more than one element! These two elements will have the same hash value, which is not guaranteed to be unique!


The only thing that is guaranteed to be unique is the key!

gsamaras
  • 71,951
  • 46
  • 188
  • 305
  • I understand collisions. But what if I dont care about collisions? What if I want to simply say "My Hash is 100% unique"; if I try to set the same element give me exception or replace this element? Mybe its for another question.. but.. – spajak Aug 23 '16 at 17:59
  • Yes @spajak it's for another question. The container will not do that for you out of the box, since it's not designed like that. I mean the hashing policy that it uses allows collisions, that's why it needs the `key` to determine equality. You will have to create your own container (that will inherit from the `STL` one) and will throw an exception if desired. I edited your question, hope you don't mind. :) – gsamaras Aug 23 '16 at 18:01
2

Different values do not necessarily have different hashes. For example, there are a practically infinite number of std::string objects, but only 2^N std::size_t objects for the result of std::hash<std::string>()(s), so some "hash collisions" are inevitable, though the algorithm tries to make them unlikely.

Therefore std::unordered_set and std::unordered_map need a way to test whether elements are equal or unequal, even if their hash values are equal.

aschepler
  • 70,891
  • 9
  • 107
  • 161
1

Quite simply, the set needs to know whether two keys are equal. KeyEqual is the mechanism for doing that.

Bear in mind that two keys that don't compare equal could hash to the same value, and the set needs to be able to check for that.

NPE
  • 486,780
  • 108
  • 951
  • 1,012