71

I'm tempted to think that the HashSet.contains(Object) method performs in constant time. It simply gets the hash code of an object and then looks it up in a hash table.

First, could someone please confirm whether this is true?

Second, if it is true, is there any risk of collisions, where two objects might have the same hash code and thus the HashSet thinks it has both when it only has one?

swaplink
  • 43
  • 2
  • 8
ktm5124
  • 11,861
  • 21
  • 74
  • 119
  • In theory, at least, you are correct. All sort of caveats for hash synonyms and table overflows, though. – Hot Licks Aug 11 '14 at 16:21
  • 1
    The time complexity of contains is the same as get. See the answer here: http://stackoverflow.com/questions/4553624/hashmap-get-put-complexity for a further discussion – Alcanzar Aug 11 '14 at 16:22

2 Answers2

99

It runs in O(1) expected time, as any hash table (assuming the hash function is decent). It is backed by a HashMap where the key is the Object.

Two objects might have the same hash code, but the HashSet wouldn't think they are identical, unless the equals method for these objects says they are the same (i.e. returns true).

The contains method calls (indirectly) getEntry of HashMap, where key is the Object for which you wish to know if it's in the HashSet.

As you can see below, two objects can be stored in the HashMap/HashSet even if their key is mapped to the same value by the hash function. The method iterates over all keys that have the same hash value, and performs equals on each one to find the matching key.

final Entry<K,V> getEntry(Object key) {
         int hash = (key == null) ? 0 : hash(key.hashCode());
         for (Entry<K,V> e = table[indexFor(hash, table.length)];
              e != null;
              e = e.next) {
             Object k;
             if (e.hash == hash &&
                 ((k = e.key) == key || (key != null && key.equals(k))))
                 return e;
         }
         return null;
     }
Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
Eran
  • 387,369
  • 54
  • 702
  • 768
  • 7
    "The method iterates over all keys that have the same hash value" so it is not `O(1)`, is it? – Ali Lotfi Jun 23 '16 at 10:20
  • 13
    @AliLotfi The `expected` time is O(1), since the average number of keys in each bucket of the HashSet is bound by a small constant. At the worst case (if all the keys are mapped to the same bucket), the search would take linear time, but unless you have a terrible hashCode method, the worst case is not expected to ever happen. – Eran Jun 23 '16 at 10:25
  • 2
    If your hash function is that bad, you have bigger problems – ccook Feb 14 '20 at 16:00
  • The worst complexity would be max of O(1) and equals function. For example a set of string will have worst time complexity of O(W) for a query string of size W. – Saurabh Kumar May 26 '20 at 07:45
33

The worst-case performance of contains will be O(log n) for Java 8, and O(n) for Java 7, but average case closer to O(1). This is because the hashset is backed by a hashmap, and thus has the same efficiency as hashmap lookup (ie, HashMap.get(...)). The actual mapping in a hashmap is constant time (O(1)), but the need to handle collisions brings the cost to log n. That is, multiple elements that hash to the same array index must be stored in a secondary data structure (aka bucket), and it is this bucket that determines the worst-case performance. In Java, hashmap collision-handling is implemented using a self-balanced tree.

Self-balanced trees guarantee O(log n) for all operations, hence, insertion and lookup in hashmap (and hashset) has a total cost of O(1) + O(log n) = O(log n). The use of a self-balanced tree for collision-handling was introduced in Java 8 as an improvement over chaining (used until java 7), which uses a linked-list, and has a worst case of O(n) for lookup and insertion (as it needs to traverse the list). Notice that chaining would have constant time for insertion (as opposed to lookup), since elements can be added to a linked list in O(1), but the set property (no duplicates) is imposed on the linked-list in the case of hashmap, and it thus need to traverse the linked-list also in the case of insertion to ensure that the element does not already exist in the list/bucket, and we end up with O(n) for both insertion and lookup.

References:

This class implements the Set interface, backed by a hash table (actually a HashMap instance). https://docs.oracle.com/javase/8/docs/api/java/util/HashSet.html

Buckets containing a large number of colliding keys will store their entries in a balanced tree instead of a linked list after certain threshold is reached. (https://www.nagarro.com/en/blog/post/24/performance-improvement-for-hashmap-in-java-8)

Community
  • 1
  • 1
Daniel Valland
  • 1,057
  • 4
  • 21
  • 45
  • 1
    I’d love to see the code of the benchmarks which led to this decision. It’s seems to be a net pessimisation. I’ve got severe doubts whether the benchmarks they show in their blog post hold up when setting a proper load factor, or against double hashing (but on the other hand I can’t imagine them not considering this; in the end, I remain puzzled). – Konrad Rudolph Aug 21 '18 at 14:13
  • @KonradRudolph Note that the linked-list is "swapped" for a tree only in the case of heavy collisions, which could be the result of custom overrides of the hashCode method. I think it is justified. If it is "pessimisation" as you say, then, if rather we assume average case, a low amount of collisions is expected, and chaining is used. Even if it did not use a threshold, in the average case the size of any given bucket would remain low, hance I doubt the extra comparisons needed for a tree vs list would make much difference. However, in the case of many elements, the benefit of a tree is clear. – Daniel Valland Aug 21 '18 at 18:42