Java HashSet Constant Time Performance does not scale well

Question

while working on performance sensitive code I discovered that HashSet should offer constant time performance for its add, remove, contains and size methods as long as the hash-function returns unique hashes for the elements in the set (HashSet Javadoc).

When executing the following code, I discovered that this constant time performance for the add method does not hold true. The following code should round in roughly double the time compared with the runtime of executing it with numberOfELements = 10.000.000. But it needs roughly 3-4 times as long.

HashSet<Integer> myHashSet;
int numberOfElements;

//Warm-up code
numberOfElements = 10000000; //10 Mio Elements
myHashSet = new HashSet(540000000); //Create HashSet with large enough initial capacity to avoid resize operations
for (int i = 0; i < numberOfElements; i++){
   myHashSet.add(i); //Add random element
}

//First benchmark
numberOfElements = 5000000; //5 Mio Elements
myHashSet.clear(); //Clear HashSet
long startTime = System.nanoTime(); //Save start time
for (int i = 0; i < numberOfElements; i++){
   myHashSet.add(i); //Add elements
}
System.out.print("Set filled with " + myJournalPoC.SFA_Journal.size() + " records in ");
System.out.println((System.nanoTime() - startTime)/1000000000 + "s");

// Second benchmark
numberOfElements = 10000000; //10 Mio Elements
myHashSet.clear(); //Clear HashSet
long startTime = System.nanoTime(); //Save start time
for (int i = 0; i < numberOfElements; i++){
   myHashSet.add(i); //Add elements
}
System.out.print("Set filled with " + myJournalPoC.SFA_Journal.size() + " records in ");
System.out.println((System.nanoTime() - startTime)/1000000000 + "s");

// Third benchmark
numberOfElements = 20000000; //20 Mio Elements
myHashSet.clear(); //Clear HashSet
long startTime = System.nanoTime(); //Save start time
for (int i = 0; i < numberOfElements; i++){
   myHashSet.add(i); //Add elements
}
System.out.print("Set filled with " + myJournalPoC.SFA_Journal.size() + " records in ");
System.out.println((System.nanoTime() - startTime)/1000000000 + "s");

// Third benchmark
numberOfElements = 40000000; //40 Mio Elements
myHashSet.clear(); //Clear HashSet
long startTime = System.nanoTime(); //Save start time
for (int i = 0; i < numberOfElements; i++){
   myHashSet.add(i); //Add elements
}
System.out.print("Set filled with " + myJournalPoC.SFA_Journal.size() + " records in ");
System.out.println((System.nanoTime() - startTime)/1000000000 + "s");

I'd really appreciate to know the root cause for this behavior.

EDIT: I added the code to measure performance in the example above.

@Michael: The code to measure run time performance was added in the post above. — Robert, Sep 13 '17 at 15:39
Given how the JVM works and JIT compiles code, writing a micro-benchmark is a non-trivial exercise. The code you posted will not suffice, please read the linked duplicate target question. — Jim Garrison, Sep 13 '17 at 15:50
@JimGarrison: Thanks for the hint about microbenchmarking. To keep things simple I've improve the code for microbenchmarking (used nanoTime() instead of currentTimeMillis(), introduced a warm-up phase and added some iterations to measure run-time.) It seems the code still does not scale linearly. Running it with 5 Mio, 10 Mio, 20 Mio, 40 Mio Elements for the hashset gives me run-times of 3sec, 6sec, 16sec, 39, secs (instead of 3,6,12,24 as expected). Is this effect still due to micro-benchmarking issues? — Robert, Sep 13 '17 at 15:56
Also consider that in a real world scenario on real hardware, idealistic theoretical runtimes are not perfect. As you scale up, there are many more objects being allocated on the heap for the GC to deal with, and a collection that is too large could run into problems fitting into cache increasing cache misses. — puhlen, Sep 13 '17 at 16:24
Code has been updated now in the question above. Thanks for your helpful comments! — Robert, Sep 13 '17 at 16:35
Have you heard of a function? :P All that copy-pasting makes it very hard to read. What are your results? Why are you subtracting nanoseconds from milliseconds? — Michael, Sep 13 '17 at 16:42
As Robert said, you big collection doesn't fit into cache, which easily explains the slowdown (there are actually multiple caches of different sizes and there are orders of magnitude between the fastest cache speed and memory speed). It's not the `HashMap`'s fault (though with an `ArrayList`, it may behave differently because of the sequential access). — maaartinus, Sep 14 '17 at 11:12

Java HashSet Constant Time Performance does not scale well

0 Answers0