Surprising Java performance

Question

I have a StressTester class like this:

public abstract class StressTest {
  public static final int WARMUP_JIT_COMPILER = 10000;

  public interface TimedAction {
    void doAction();
  }

  public static long timeAction(int numberOfTimes, TimedAction action) {
    ThreadMXBean bean = ManagementFactory.getThreadMXBean();
    for (int i = 0; i < WARMUP_JIT_COMPILER; i++) {
      action.doAction();
    }
    long currentTime = bean.getCurrentThreadCpuTime();
    for (int i = 0; i < numberOfTimes; i++) {
      action.doAction();
    }
    return (bean.getCurrentThreadCpuTime() - currentTime)/1000000;
  }
}

And a main method looks something like this:

private static boolean isPrime1(int n) { ... }
private static boolean isPrime2(int n) { ... }
private static boolean isPrime3(int n) { ... }
private static boolean isPrime4(int n) { ... }

private static final int NUMBER_OF_RUNS = 1000000;

public static void main(String[] args) {
  long primeNumberFinderTime1 = StressTest.timeAction(NUMBER_OF_RUNS, () -> {
    for (int i = 0; i < 100; i++) {
      isPrime1(i);
    }
  });
  long primeNumberFinderTime2 = StressTest.timeAction(NUMBER_OF_RUNS, () -> {
    for (int i = 0; i < 100; i++) {
      isPrime2(i);
    }
  });
  long primeNumberFinderTime3 = StressTest.timeAction(NUMBER_OF_RUNS, () -> {
    for (int i = 0; i < 100; i++) {
      isPrime3(i);
    }
  });
  long primeNumberFinderTime4 = StressTest.timeAction(NUMBER_OF_RUNS, () -> {
    for (int i = 0; i < 100; i++) {
      isPrime4(i);
    }
  });
}

When I have it set up like that then the results are pretty much as expected, and I can swap the tests and the results swap as expected. isPrime3 is about 200 times faster than isPrime1.

My real code is a bit more complex. I have several classes that find prime numbers like this:

class PrimeNumberFinder1 {
  @Override
  bool isPrime(i) { /* same code as in static isPrime1() */ };
}

class PrimeNumberFinder2 extends PrimeNumberFinder1 {
  @Override
  bool isPrime(i) { /* same code as in static isPrime2() */ };
}

class PrimeNumberFinder3 extends PrimeNumberFinder1 {
  @Override
  bool isPrime(i) { /* same code as in static isPrime3() */ };
}

class PrimeNumberFinder4 extends PrimeNumberFinder1 {
  @Override
  bool isPrime(i) { /* same code as in static isPrime4() */ };
}

And I have a class like this:

class SomeClassWithPrimeNumberFinder {
  PrimeNumberFinder1 _pnf;

  void setPrimeNumberFinder(PrimeNumberFinder1 pnf) {
    _pnf = pnf;
  }

  void stressTest() {
    StressTest.doAction(10000000, () -> {
      for (int i = 0; i < 100; i++) {
        _pnf.isPrime(i);
      }
    });
  }
}

And my main method:

public static void main(String() args) {
  SomeClassWithPrimeNumberFinder sc = new SomeClassWithPrimeNumberFinder();
  sc.setPrimeNumberFinder(new PrimeNumberFinder1());
  sc.stressTest();
  sc.setPrimeNumberFinder(new PrimeNumberFinder2());
  sc.stressTest();
  sc.setPrimeNumberFinder(new PrimeNumberFinder3());
  sc.stressTest();
  sc.setPrimeNumberFinder(new PrimeNumberFinder4());
  sc.stressTest();
}

With this setup PrimeNumberFind1 is about as fast as isPrime1() in the first test. But PrimeNumberFind3 is about 200 times slower than isPrime3() in the first test.

If I move PrimeNumberFind3 so it runs first, I get the same times as isPrime3() in the first test. The rest of the times are a bit slower too (5-10%), but nothing like PrimeNumberFind3.

The first 3 PrimeNumberFind's are just loops and ifs. No state involved. The last one has a constructor that creates a lookup list, but is just a simple loop as well. If I take the code out of the constructor and create the lookup list with an array literal, the timing is identical.

Any ideas why this is happening?

Have you tried accumulating the result into a `volatile` field instead of throwing it away? — biziclop, Sep 22 '15 at 20:01
I tried accumulating the result into a `volatile` field, but it made no difference. I'm new to Java, but volatile has something to do with threads caching their own version of a value, right? This program is purely single-threaded. — carlsb3rg, Sep 22 '15 at 20:23
@carlsb3rg You're right, `volatile` is related to multi-threading, preventing re-ordering of reads/writes of a field (roughly speaking). But as a side-effect of the above, it also stops reads/writes being optimized away completely. And this is why they are used to accumulate results even in single-threaded micro-benchmarks. — biziclop, Sep 22 '15 at 20:29
@biziclop Accumulating was the key, but I didn't have to accumulate to a volatile field. @peter-lawrey got it, but thanks anyway. I'm sure `volatile` will come in useful later ;) — carlsb3rg, Sep 22 '15 at 20:48
What's with the down-votes? Every bit of relevant code was included. — carlsb3rg, Sep 22 '15 at 21:04

score 4 · Accepted Answer · answered Sep 22 '15 at 20:03

4

What is likely to be happening is that initially isPrime is discard as dead code as the result is not kept/used so it will be running impossibly fast i.e. faster than a clock cycle for example.

However, when you provide multiple implementations it has to choose which method to call, so it takes a bit longer with a second method but the JIT can't inline more than two methods so the third implmentation means the test is dramatically slower as it has to do more work to call the discarded method.

answered Sep 22 '15 at 20:03

Peter Lawrey

525,659
79
751
1,130

1

@JohnBollinger while the best thing to do is to fix the test, I though it worth explaining the result he got. ;) – Peter Lawrey Sep 22 '15 at 20:19
Yes. The static versions (specifically isPrime3()) was running impossibly fast because all it contained was `return (n == 2 || n == 3...)` for all prime numbers under 100. And JIT inlining two methods makes perfect sense when I look at the numbers. – carlsb3rg Sep 22 '15 at 20:51
Now I just have to figure out why a method that returns false for n < 2, true for n < 4, false for n % 2 == 0 and a loop that tries every odd number from 3 -> sqrt(n) is 10% faster than a return statement that just or's n with all prime numbers < 100... – carlsb3rg Sep 22 '15 at 21:08
1

@carlsb3rg Inlining could be the key here too. That long chain of `or`s may kick your method beyond the instruction limit for inlining, while your short loop is probably handled much better. I recommend you take a look at [jitwatch](https://github.com/AdoptOpenJDK/jitwatch/), it might come handy. – biziclop Sep 22 '15 at 21:16
Thank you very much! I'll check out jitwatch and see if I can do some bit manipulation instead. – carlsb3rg Sep 22 '15 at 21:33

Surprising Java performance

1 Answers1