2

Is this sentence from an OCP book correct?

A HashSet stores its elements in a hash table, which means the keys are a hash and the values are an Object.

I have created below code snippet:

class A
{
    private String s;
    public static void main(String[] args) throws NoSuchFieldException, IllegalAccessException  {
        HashSet<A> elems =  new HashSet<>(Set.of(new A("abc"), new A("cba"), new A("a")));
        Field field = elems.getClass().getDeclaredField("map");
        field.setAccessible(true);
        HashMap privateMap = (HashMap) field.get(elems);
    }

    public A(String s)
    {
        this.s = s;
    }
}

And as a result I retrieved the following HashMap:
enter image description here

It looks like it's the object itself being a key in the HashMap instead of its hash being the key. Is there an error in the book or am I looking at something irrelevant?

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
bridgemnc
  • 189
  • 1
  • 2
  • 11
  • Why wonder when you have [the source code for `HashSet`](https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/util/HashSet.java) available? :) – Federico klez Culloca May 02 '23 at 13:07
  • Have a look what happens when you have two keys with the same hash, e.g. [`FB` and `Ea`](https://stackoverflow.com/a/10102397/3788176). – Andy Turner May 02 '23 at 13:09
  • 3
    I'd say the book is wrong, or it is oversimplifying things (given you can have multiple objects with the exact same hash in a HashSet). It is probably talking about the fact the hash is used to index into the hash table, where you can then find the objects. BTW: It is more relevant to look at the source of [`HashMap`](https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/util/HashMap.java) than of `HashSet`. – Mark Rotteveel May 02 '23 at 13:10

2 Answers2

4

Of course it is the object and not the hash. Some facts which are individually obvious or at least easy to verify:

  • Given that hashes are fixed size (2^32 different hashes exist, that's it though), and there are objects such as Strings for which more than 2^32 different values exist (An infinite amount of them), it is a mathematical certainty that different strings exist that hash to the same value. This is called the pigeon hole principle if you prefer to watch some videos or some such to explain this concept in more detail.
  • It would be incredibly annoying if HashMap/HashSet considers different strings as nevertheless equal just because their hash codes so happen to collide. In fact, that would mean these types aren't conforming to their own specification (javadoc), because they clearly state collisions will slow things down but aren't otherwise a problem.
  • Hence, HashSet/Map needs the original object so that it can use .equals() to verify that things are actually equal.

The way hashmap/set works is as follows:

  • Use the hashCode of an object to find the 'bucket' to look in.
  • Then just scan, one at a time, each object in that bucket. Yes, this is slow - hence you don't want a ton of hash collisions.

And thus this concludes:

  • Given two objects a and b, if a.equals(b) is true, then a.hashCode() == b.hashCode() HAS to be true, or you can't use these objects in hashset/hashmap because bizarre things will happen if you do, with keys/objects you just added disappearing into thin air but still showing up when iterating through, and so on.
  • The reverse DOES NOT HOLD - If a.hashCode() == b.hashCode() is true, then a.equals(b) doesn't have to be. It could just be a collision which is fine.
  • That means theoretically @Override public int hashCode() { return 1; } is valid. Indeed it is; HashMap and HashSet will operate as described, as long as the equals method is working fine. However, such maps/sets will be O(n) - their performance is linear to how many items are in there. Which is unexpected; the point of hashset/map is that their performance doesn't meaningfully change as they grow. HashMap/Set aren't magic - they need a good hashCode() impl to fulfill their performance promises.

If you read some docs or books that say the hashcode is the key, then that's either from a section that is oversimplifying (accent on the over - that's probably not a good thing to put in a tutorial or explanation), or the author misunderstands how this stuff works, or its a description for hashmap impl in another system/language which takes the highly dubious shortcut of equating 'equal hashes' to '.. thus equal'. Given Pigeonhole Principle and all that, this is a bad idea. If you MUST do things that way, use a good hash algo and way more bits than 32!

rzwitserloot
  • 85,357
  • 5
  • 51
  • 72
  • 1
    This description is also a simplification; the hash map implementation will try to solve hash collisions by resorting to a binary tree if there are too many of them in a single bucket. That works for strings as they are comparable. – Holger May 03 '23 at 16:37
1

In simple terms what actually happens is the following:

A HashSet is backed by a HashMap meaning behind the scenes the HashSet uses a HashMap data structure.

Whatever value you add to the HashSet is added as a key to the HashMap which exists behind the scenes. Since HashMap does not allow duplicate keys, during several insertions of the same key, only the latest key is the one that survives. Therefore it fulfils the need that a HashSet contains only unique values.

For all above insertions in a HashSet, the value that you insert in HashSet is inserted as a key in the HashMap. At the same time as a value to the HashMap undisclosed to you is inserted always the same value for the specific HashMap instance from the line static final Object PRESENT = new Object() -> reference. . Therefore all values which are inserted behind the scenes to a HashMap backing up a specific HashSet are always a specific reference created from new Object(). For another HashSet there would be a different value, but for the same HashSet it always use the same reference.

Having said the above, to answer the title of the question

What is the key in the Set's underlying HashMap: hash code or the object itself?

The key in underlying HashMap would be the reference to the actual object which was inserted as a value to the HashSet. Now how the HashMap stores this reference in the HashMap data structure is that it calculates the hashcode of that object to identify the bucket of HashTable... (read more about hashmap to track it further how hashMap stores elements)

Panagiotis Bougioukos
  • 15,955
  • 2
  • 30
  • 47