What's the magic behind a Hashset finding duplicates that incredibly fast?

Question

Ok this might be a superstupid question, but I'm a bit baffled atm and eager to hear what you can tell me about this.

I had an ArrayList with about 5 million longs added. These longs are calculated hashes for primary keys (concatenated Strings) out of a big csv file.

Now I wanted to check for uniqueness and looping through the list like that:

for(int i=0;i<hashArrayList.size();i++)
{
   long refValue = hashArrayList.get(i)
   for(int j=i+1;j<hashArrayList.size();j++)
   {
      if(refValue == hashArrayList.get(j))
      --> UNIQUENESS VIOLATION, now EXPLODE!!
   }
}

This way it takes HOURS.

Now about the Hashset, which doesn't allow duplicates by itself. A hashset.addAll(hashArrayList) takes 4 seconds! while eliminating/not adding duplicates for this list with 5 mio elements.

How does it do that? And: Is my ArrayList-looping so stupid?

addAll function does not iterate your list twice. It's using a hashing function to find the duplicates. — Roumelis George, Jun 05 '18 at 11:10

Thiyagu · Accepted Answer · 2018-06-05T11:09:35.570

You are doing a totally a different comparison.

With an ArrayList, you have a nested for loop which makes it O(n^2).

But with a HashSet, you are not doing any looping, but just adding n elements to it which is O(n). Internally a HashSet uses a HashMap whose key is the individual elements of the list and value is a static Object.

Source code for HashSet (Java 8)

public HashSet(Collection<? extends E> c) {
    map = new HashMap<>(Math.max((int) (c.size()/.75f) + 1, 16));
    addAll(c);
}

addAll calls add

public boolean add(E e) {
    return map.put(e, PRESENT)==null;
}

So, ultimately it all comes to inserting an object (here long) into a HashMap which provides a constant time performance ¹

¹ From javadoc of HashMap (emphasis mine)

This implementation provides constant-time performance for the basic operations (get and put), assuming the hash function disperses the elements properly among the buckets

Shryne · Answer 2 · 2018-06-05T11:58:21.937

A hash-based collection doesn't need looping to check whether there are elements with the same key.

Imagine you have 1.000 objects X. In your case, you loop through the list every time you add something.

A hash-based collection calculates the hash of the object, looks inside whether there are other elements with the same hash and then just needs to check whether one of them is equal to the new element. If you have a good hash function that returns a unique hash for unique elements, you would just have to calculate a number.

Of course, if you just say "I am lazy and I override my hashCode method with return 1", then you would have the same amount of comparisons additional to the hash collection overhead.

Example: Imagine you have the following HashSet:

HashSet: [[obj1], [null], [null], [null], [obj2, obj3, obj4]]

As you see, the basic structure (can be) like this: An array containing other data structures with the actual entries. Now, if you put an obj5 into the HashSet, it will call obj5.hashCode(). Based on this, it will calculate the outer index of this obj. Let's say it's 4:

HashSet: [[obj1], [null], [null], [null], [obj2, obj3, obj4]]
                                                  ^ obj5

Now we have three other objects with the same index. Yes, we need a loop here to check, whether some of them is equal to the new obj5, but if you have a bigger HashSet with millions of entries, a comparison with some elements is much faster than comparing with all the elements. This is the advantage of a hash-based collection.

Ok thx. but what I still don't quite get is: Even if the set contains only hashes of objects, you'd still need to compare an incoming new hash with each of the already existing hashes, everytime, no? - so, still a loop over everything as soon as something's added? — Toni Kanoni, Jun 05 '18 at 11:24
I added an example. As you see, a hash translates into the index (not exactly. If your hashCode say 1.000.000, it won't be this index) and thus doesn't need to be compared with all other hashes. — Shryne, Jun 05 '18 at 11:59

score 0 · Answer 3 · answered Jun 05 '18 at 11:07

0

Hashmap internal working

Moreover you are using loop inside a loop which is making the complexity O(n^2) which is less efficient what hashmap uses.

answered Jun 05 '18 at 11:07

Arnab Dhar

239
2
15

At least I understand that part, yes :) – Toni Kanoni Jun 05 '18 at 11:26

What's the magic behind a Hashset finding duplicates that incredibly fast?

3 Answers3