Why can’t I observe the same performance improvement on both JVMs 32 and 64bits?

Question

I was testing two different approaches (primes() and primesOpt()) to collect the first N prime numbers using the Java 8 IntStream. I took these examples from chapter 6 of Java 8 in Action. You can get the source code from this gist Primes.java and this pom.xml to build it with Maven and JMH integration. (you can copy pom.xml to project folder and Primes.java to src\main\java\primes and build it with the command: mvn clean install. After that you can run the benchmark with: java -jar target\benchmarks.jar).

The first example (primes() method) is a straightforward algorithm to collect N prime numbers into a List<Integer>. And the second one (primesOpt() method) is an enhanced approach, which only tests divisions by previous prime numbers.

I test both implementations with JMH to calculate a List<Integer> of prime numbers to the maximum of 10,000:

@Benchmark
public int testPrimes() {
    return primes(10_000).size();
}

@Benchmark    
public int testPrimesOpt() {
    return primesOpt(10_000).size();
}

And I got different speedups depending on the JVM architectures. In JVM 64 bits I observe a speedup of 25% for primesOpt() over the standard version primes(), whereas for JVM 32 bits there is no speedup.

Results for JRE 1.8.0_91-b14 64-Bit:

Benchmark              Mode  Cnt    Score    Error  Units
Primes.testPrimes     thrpt   50  269,278 ± 15,922  ops/s
Primes.testPrimesOpt  thrpt   50  341,861 ± 25,413  ops/s

Results for JRE 1.8.0_91-b14 32-Bit:

Benchmark              Mode  Cnt    Score   Error  Units
Primes.testPrimes     thrpt  200  105,388 ± 2,741  ops/s
Primes.testPrimesOpt  thrpt  200  103,015 ± 2,035  ops/s

These tests were performed on machine with a dual-core Intel I7 Cpu, with hyperthreading, resulting in 2 cores and 4 hardware threads. Furthermore, the system has 4GB of RAM. The JVM version used was the 1.8.0_91-b14, running on Windows 7. The benchmark was executed with the minimum heap size of 1024MB (corresponding to -Xms1024M). During the measurements there was no other activities running.

Do you have any idea why can’t I observe the same performance improvement on JVM 32 bits for the optimized version of the primes algorithm?

primes() method implementation:

public static boolean isPrime(int n) {
    int root = (int) Math.sqrt(n);
    return IntStream
        .rangeClosed(2, root)
        .noneMatch(div -> n%div == 0);
}
public static List<Integer> primes(int max) {
    return IntStream
        .range(2, max)
        .filter(Primes::isPrime)
        .collect(ArrayList::new, ArrayList::add, ArrayList::addAll);
}

primesOpt() method implementation:

public static boolean isPrimeOpt(List<Integer> primes, int n) {
    int root = (int) Math.sqrt(n);
    return takeWhile(primes, root)
        .stream()
        .noneMatch(div -> n%div == 0);
}

public static List<Integer> takeWhile(List<Integer> src, int max) {
    int i;
    for(i = 0; i < src.size() && src.get(i) <= max; i++) {}
    return src.subList(0, i);
}

public static List<Integer> primesOpt(int max) {
    ArrayList<Integer> res = new ArrayList<>();
    return IntStream
        .range(2, max)
        .filter(n -> Primes.isPrimeOpt(res, n))
        .collect(() -> res, ArrayList::add, (l1, l2) -> {});
}

I suggest you use JMH to validate your micro benchmarks. If you get strange results it is hard to say whether there is a real difference of a fault in the test. — Peter Lawrey, May 16 '16 at 19:45
@PeterLawrey I updated OP with JMH results. I did not observed a two fold speedup, but I still observe inconsistent results between the 64 bits and 32 bits versions of the JVM. The `primesOpt()` performs 20% better on JVM 64 bits, but it is almost 10% worst on JVM 32 bits. And this result is still the reason of my question: Why can’t I observe the same performance improvement on both JVMs 32 and 64bits? — Miguel Gamboa, May 16 '16 at 21:57
I think you should start by *profiling* both versions of the benchmark on both platforms. Also, turn on GC logging. — Stephen C, May 16 '16 at 22:45
what is your memory settings? What is the GC log look like? Was your PC running 100%, disk swap etc. With performance wise you need to look at the full environment settings to ensure there's no bottleneck. — Minh Kieu, May 17 '16 at 08:44
@MinhKieu JVM configuration with -Xms1G, in both cases. System Physical Memory ocupation of ~50%. CPU at 25%, corresponding to on only Java process running the JMH Benchmark on 25% CPU usage (in a Intel i7 dual core with hyperthreading, corresponding to 4 virtual processors). Nothing else running on the machine during the benchmark execution. — Miguel Gamboa, May 17 '16 at 08:52
Try using -Xms and -Xmx to 3G. I would be good to see if GC is running at all. — Minh Kieu, May 17 '16 at 09:16
@MinhKieu "_ On most modern 32-bit Windows systems the maximum heap size will range from 1.4G to 1.6G_" taken from [OP](http://stackoverflow.com/a/1434901/1140754). But I will update the results of my OP with -Xms to 1024M. — Miguel Gamboa, May 17 '16 at 09:51
On 32bit JVM, max heap can be assigned to 3G. You should set both Xms and Xmx to be the same so that the JVM don't have to keep inceasing when needed. I am not sure if it will improve you result but try it anyway. On the debate of 32vs64 bits, I would say 32 bit should run faster than 64 bits on a small scale application. The advantage of using 64 bits is so you can assign more than 4G heap space. — Minh Kieu, May 17 '16 at 09:54
@MinhKieu As I mentioned on Windows you cannot assign 3G to heap for 32 bits version of JVM. — Miguel Gamboa, May 17 '16 at 10:01
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/112143/discussion-between-minh-kieu-and-miguel-gamboa). — Minh Kieu, May 17 '16 at 10:22
That `collect(() -> res, ArrayList::add, (l1, l2) -> {})` is a completely broken abuse of the API. Since you are actually doing the equivalent to `forEachOrdered(res::add)`, just use that. — Holger, May 17 '16 at 12:02

score 3 · Accepted Answer · answered May 17 '16 at 16:47

I can’t reproduce your results, but generally, performance can differ significantly depending on environmental factors. In your code, the takewhile approach forces processing of boxed Integer values whereas the non-optimized variant isPrime is processing int values only.

That trade-off should pay off the more primes you request, i.e. if scanning the first 10_000 numbers shows ambivalent results, try 100_000 or 1_000_000. The boxing overhead is linear at worst, a decent JVM might turn it into a sub-linear or even constant overhead, whereas the improvement of limiting the divisions to the actual primes should be higher than linear as the density of primes descends with higher numbers.

So perhaps the 64Bit JVM you have used has a higher overhead when processing boxed values, but I assume, that testing with a higher max will also reveal an advantage for your optimized variant—unless that JVM knows a trick to reduce the cost of division operations significantly.

But it shouldn’t be ignored that your optimized variant is broken in several ways. You are passing the supplier () -> res to collect which violates the contract as it doesn’t return a new container on each evaluation and there’s an interference between the Collector and the Predicate used in the preceding filter step.

This suggest that trying to optimize your Stream based solution perhaps doesn’t lead to somewhere. Compare with the following straight-forward approach:

public static List<Integer> primesBest(int max) {
    BitSet prime=new BitSet();
    prime.set(1, max>>1);
    for(int i=3; i<max; i+=2)
        if(prime.get((i-1)>>1))
            for(int b=i*3; b<max; b+=i*2) prime.clear((b-1)>>1);
    return IntStream.concat( IntStream.of(2),
        prime.stream().map(i->i+i+1)).boxed().collect(Collectors.toList());
}

It avoids all division and boxing operations at the “disadvantage” of not using Stream operations for the value selection, but only for the creation of the final List<Integer>. On my machine, it is about 10 times faster than your optimized variant for 10_000 elements, 50 times faster for 1_000_000 elements. That suggests that performance differences in the order of 10%, 20% or even factor two or three are not worth any discussion.

I don’t see how this algorithm can be expressed using the Stream API though. The bottom line might be that not every operation benefits from the Stream API.

Why can’t I observe the same performance improvement on both JVMs 32 and 64bits?

1 Answers1

Linked