5

I'm trying to understand what sort of compile-time optimizations I can hope for in Java when a code is executed many times. I'm in particular interested in arithmetic simplifications in the following scenario:

Imagine that you need to transform a million 3d points using the same 3d affine transformation. If the transformation turns out to be a pure translation, a good optimizer would be able to turn 12 multiplications and 12 additions into 3 additions only, because all multiplications are multiplications by 1 and many additions are additions with zeros.

Before trying this 'complex scenario', I just ran some multiplications and additions in loops, and despite the cool stuff I've been reading about Java's JIT compiler(s), I've been a bit disappointed.

First - what works: performing many multiplications by 1 seems to be simplified and is executed very fast:

        tic();
        final int nRepetitions = 100_000_000;
        final double factor1 = 1.0d;
        value = 0.0d;
        for (int i=0;i<nRepetitions;i++) {
            for (int j=0;j<20;j++) {
                value = value * factor1;
            }
        }
        System.out.println("Result = "+value);
        toc();

I get this execution speed, which is fast:

 Result with graalvm-ce-17\bin\java.exe
 ----------------------
 Repeating 100000000 multiplication by factor1 = 1.0, a final variable
 The code is put in the main method
 This is overall is No op, and should be super fast
 Result = 0.0
 Elapsed time   64.8528  ms
 ----------------------

If I perform the same computation, but by calling a function, I do not get any optimization at all, and the execution speed is about 2 seconds.

        tic();
        value = 0.0d;
        repeatMultiply(nRepetitions, value, factor1);
        toc();

The function being:

    public static void repeatMultiply(int nRepetitions, double value, final double multFactor) {
        for (int i=0;i<nRepetitions;i++) {
            for (int j=0;j<20;j++) {
                value = value * multFactor;
            }
        }
        System.out.println("Result = "+value);
    }

Now the code runs very slowly:

 ----------------------
  Repeating 100000000 multiplication by factor1 = 1.0d
  Function called = repeatMultiply
  This is overall is No op, and should be super fast
  Result = 0.0
  Elapsed time  1815.1354    ms

I've tested other things. Not declaring the variable factor1 as final in the first example destroys the only optimization I saw. I then tried to add zeros instead of multiplying by ones, but it was even worse: I always get the 'long' execution time which is about two seconds on my machine.

I've tested Oracle JDK v1.8 and v18, also Graal VM Community Edition 17, but none seemed to make a bit of difference. I've put a gist with all the code and tests I performed in https://gist.github.com/NicoKiaru/2949e6969087e75b07b21596d80c7882

I hope you can enlighten me and let me know if these results reflect an intrinsic limitation of Java's JIT compilers of if I've done something wrong in my testing.

Is there any way my goal (auto-optimisation of affine transformations computation) can be reached with JIT compilation, or should I stop dreaming and explicitely test the "simple scenario" (like translation) explicitely in the Java code ?

Boann
  • 48,794
  • 16
  • 117
  • 146
  • 4
    Graal is not JIT. Further, benchmarking JIT runtimes like HotSpot is _extremely hard_, and I'd guess with something like 98% confidence that you are doing it incorrectly. Start by reading about and using JMH. – Louis Wasserman Jul 25 '22 at 20:56
  • 1
    https://developers.redhat.com/articles/2021/06/23/how-jit-compiler-boosts-java-performance-openjdk#deoptimization_and_speculation – OldProgrammer Jul 25 '22 at 20:56
  • 5
    @Robert 1.0 *does* exist as a double. It is exactly expressible in binary. Unlike, say, 0.3. – khelwood Jul 25 '22 at 21:07
  • 1
    @Robert What? Of course the double value 1.0 exists. The mantissa is 1.0, to as many significant figures as you like, and the exponent is 0. – Dawood ibn Kareem Jul 25 '22 at 21:19
  • 3
    My guess: It has to do with `value` being a _parameter_ in the method version of your code. This makes it much harder to guarantee that `value` will **always** be `0.0`. – Slaw Jul 25 '22 at 21:39
  • Thanks for the all the comments! It's nice to see such fast feedbacks! @louis-wasserman I've seen JMH somewhere as the proper way to benchmark JIT, it looks a bit complicated but I'll give it a try. I also tried to read (and watch) as many things as possible regarding JIT runtime optimization, but I've never clearly found whether it was actively trying to remove 'useless' arithmetic operation (which would bring a big perf advantage to JIT vs AOT compilation). Is it possible in theory ? I hope JMH can reveal why the magic does not seem to happen. – Nicolas Chiaruttini Jul 25 '22 at 22:22
  • 1
    What if you call the method many times, instead of just the once? Is it eventually optimized? – Slaw Jul 25 '22 at 23:00
  • If I were writing this, I'd be tempted to check at the beginning of the "apply transformation" method whether the transformation is an affine one - and if so, run code that doesn't have the otiose additions and multiplications. I don't feel it's the JIT's job to know how likely an expression is to come out to 0 or 1. – Dawood ibn Kareem Jul 25 '22 at 23:08
  • 8
    Your manually written benchmark collected nearly all pitfalls: no warmup; multiple measurements in one method; measuring an OSR stub instead of a regular compiled method; including printout in the measurement, etc. No wonder the benchmark results are misleading. Please check https://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java – apangin Jul 25 '22 at 23:36
  • 1
    *"a good optimizer would be able to turn 12 multiplications and 12 additions into 3 additions only, because all multiplications are multiplications by 1 and many additions are additions with zeros"* - only if multiplier is known to be a constant at compile time. This is rarely a case in a real application (otherwise, why would you at all leave multiplication by 1 in the source code?) – apangin Jul 25 '22 at 23:41
  • Ah, ok! Then thanks for the link @apangin. I'll dig, rewrite, and hope that'll make a difference. But do you know if such a jit optimisation is possible in theory? – Nicolas Chiaruttini Jul 25 '22 at 23:43
  • @apangin I thought JIT compilation was allowing to do runtime optimisation that were impossible with Aot compilation. If the user enters a translation matrix as an affine transform, I kind of hoped JIT compiled code optimisation could kick in if the same op is repeated many times. Now if it's not possible, I'm happy to learn that as well. – Nicolas Chiaruttini Jul 25 '22 at 23:49
  • 4
    HotSpot JIT compiler does not profile operands of arithmetic expressions. That would be too expensive. What it *does* profile is conditional branches and receivers of virtual methods. – apangin Jul 25 '22 at 23:57
  • So afaiu, a computational graph is built during jit compilation, right? If the data flow shows that a static variable is 'plugged' to a mult op, it however can't know that the variable is a one (it's not profiled), and thus there's never a reason to remove the op. Is this somewhat the idea? – Nicolas Chiaruttini Jul 26 '22 at 00:22
  • 1
    Your can't rely on your benchmarks to make conclusions about JIT - try JMH. For example remove System.out from tic-toc sections and add a for loop to repeat main a few times and you should see dramatically different results (all cases under 1ms on my PC). – DuncG Jul 26 '22 at 09:00
  • @DuncG if I remove System.out, then I guess the compiler can detect that all the code does nothing and thus skips it completely (which is good, but it's not what I want to test). Shouldn't I produce and use an output somewhere ? Anyway the upvoted comments are pretty clear : the benchmark is bad, so I need to redo it. It'll take me some time to get familiar with proper benchmarking however. – Nicolas Chiaruttini Jul 26 '22 at 10:27
  • 1
    @NicolasChiaruttini Right, if the JVM can deduce at JIT compilation time that the operand is constant, it'll make use of this information. If it can't deduce statically, JIT won't look at the actual value. – apangin Jul 26 '22 at 10:29
  • 1
    @NicolasChiaruttini Not what I'm saying. System.out can be slow and will skew timing if included in your metrics. Just move your System.out lines so they are not inside the timed tic/toc section. Make `repeatMultiply` return value so the result can be printed after `toc()` and not as part of the measurement. – DuncG Jul 26 '22 at 11:07
  • 1
    I know this isn't what you asked, but 3D point transformations are very simple and fast even if you do the full matrix. It's probably not going to be the bottleneck. If it does cause a slowdown, and you know that most of the transformations are simple translations, this is something you can fix yourself explicitly, in any programming language: just check before looping over an array of points, what type of matrix do we have? If it's just translation only, do the reduced version, otherwise, do the full version. – Boann Jul 26 '22 at 13:50
  • @Boann, you're right, but my dream was to go one step further: imagine you need to extract an affine transformed 2d array from a source 2d array, and it's a case so simple that it turns out to be equivalent to target[i] = source[i+10], maybe the compiler could realize that and then vectorise these operations. JIT compilation would really shine. I've read that auto-vectorisation is very difficult, so that probably won't happen soon. – Nicolas Chiaruttini Jul 26 '22 at 14:59
  • 1
    @NicolasChiaruttini I'm not aware of the JIT ever compiling code into multiple choice versions based on particular values of numbers. Unless *all* calls in the entire program are the "simple case", that's the kind of optimization that you'd have to do yourself. – Boann Jul 26 '22 at 16:11

0 Answers0