Background
This is a research that started from a Data Structures and Algorithms lecture. I apologize for the long background, but it is necessary to understand the question.
The partitioning algorithms Lomuto and Hoare were benchmarked to see if their running time corresponds to what the theory predicts. Analysis shows that they perform the following number of swaps, depending on the length of the input array n, are approximately:
- Lomuto:
n/2
- Hoare:
n/4
; if big and small numbers are equally distributed in the array - Hoare:
n/6
; if big and small numbers are not equally distributed in the array
Swapping is the most expensive operation of these algorithms, so it is a good estimate for how they compare in terms of running time. This means that if the array is random, having big and small numbers equally dstributed, and Hoare has a running time of 20 micros
, Lomuto should take approximately 40 micros
, which is 2 times slower.
The microbenchmark:
public static double averageRuntimeTests(Function<int[], Integer> algorithm, int size,
int warmupIterations, int measureIterations, int repeatedExecs) {
long start, end;
int result = -1;
double[] runningTimes = new double[measureIterations];
// Warmup
for (int i = 0; i < warmupIterations; i++) {
int[] A = randomArray(size);
for (int j = 0; j < repeatedExecs; j++)
result = algorithm.apply(A);
System.out.print(result);
}
// Measure
for (int i = 0; i < measureIterations; i++) {
int[] A = randomArray(size);
start = System.nanoTime();
for (int j = 0; j < repeatedExecs; j++)
result = algorithm.apply(A);
end = System.nanoTime();
System.out.print(result);
runningTimes[i] = (end - start) / (repeatedExecs * 1000.0);
}
return average(runningTimes);
}
The microbenchmark corresponds with the theory, if the algorithms are run sufficiently many times on the same input array such that the JIT compiler "has enough time" to optimize the code. See the following code, where the algorithms are run 30 times for each different array.
>>> case 1
public static void main(String[] args) {
int size = 10_000, warmupIterations = 10_000, measureIterations = 2_000, repeatedExecs = 30;
System.out.printf("input size = %d%n", size);
System.out.printf("#warmup iterations = %d%n", warmupIterations);
System.out.printf("#measure iterations = %d%n", measureIterations);
System.out.printf("#repeated executions = %d%n", repeatedExecs);
double timeHoare = averageRuntimeTests(Partitioning::partitionHoare, size, warmupIterations,
measureIterations, repeatedExecs);
double timeLomuto = averageRuntimeTests(Partitioning::partitionLomuto, size, warmupIterations,
measureIterations, repeatedExecs);
System.out.printf("%nHoare: %f us/op%n", timeHoare);
System.out.printf("Lomuto: %f us/op%n", timeLomuto);
}
Result:
- Lomuto:
7.94 us/op
- Hoare:
3.7685 us/op
Lomuto is approximately 2 times slower than Hoare, as expected for a uniformly distributed array.
Result when Lomuto runs before Hoare:
- Lomuto:
13.513 us/op
- Hoare:
3.5865 us/op
Lomuto is 4 times slower than Hoare, more than it should. For some reason it takes almost double the time if it runs before Hoare.
Problem
However, if the algorithms run just one single time for each different input array, the JIT compiler behaves unexpectedly.
>>> case 2
public static void main(String[] args) {
int size = 10_000, warmupIterations = 300_000, measureIterations = 60_000, repeatedExecs = 1;
System.out.printf("input size = %d%n", size);
System.out.printf("#warmup iterations = %d%n", warmupIterations);
System.out.printf("#measure iterations = %d%n", measureIterations);
System.out.printf("#repeated executions = %d%n", repeatedExecs);
double timeHoare = averageRuntimeTests(Partitioning::partitionHoare, size, warmupIterations,
measureIterations, repeatedExecs);
double timeLomuto = averageRuntimeTests(Partitioning::partitionLomuto, size, warmupIterations,
measureIterations, repeatedExecs);
System.out.printf("%nHoare: %f us/op%n", timeHoare);
System.out.printf("Lomuto: %f us/op%n", timeLomuto);
}
Result (whether Hoare runs before Lomuto or not):
- Lomuto:
26.676133 us/op
- Hoare:
31.8233 us/op
It is shocking to see that Lomuto is even faster than Hoare! What is the JIT compiler doing here?
I am constantly saying that the JIT compiler is to be blamed, because if I disable it completely and run in interpreter-only mode (using the -Djava.compiler=NONE
flag) the benchmark is again as expected. Running algorithms one single time for each different array...
>>> case 3
Result (whether Hoare runs before Lomuto or not):
- Lomuto:
597.76 us/op
- Hoare:
254.0455 us/op
As you can see Lomuto is again approximately 2 times slower, as expected.
Could someone please explain what is going on with the JIT compiler in case 2? It looks like the JIT compiler just partially optimizes the code. But then, why is Lomuto as fast as Hoare? Shouldn't Hoare be still faster?
Please note:
- I know that there is the JMH library to reliably run microbenchmarks in Java. I am just trying to understand the underlying mechanics of the JIT compiler.
- micros is a shorthand for microseconds and is the same as us.