Java 8u40 Math.round() very slow

Question

I have a fairly simple hobby project written in Java 8 that makes extensive use of repeated Math.round() calls in one of its modes of operation. For example, one such mode spawns 4 threads and queues 48 runnable tasks by way of an ExecutorService, each of which runs something similar to the following block of code 2^31 times:

int3 = Math.round(float1 + float2);
int3 = Math.round(float1 * float2);
int3 = Math.round(float1 / float2);

That's not exactly how it is (there are arrays involved, and nested loops), but you get the idea. Anyway, prior to Java 8u40, the code that resembles the above could complete the full run of ~103 billion instruction blocks in about 13 seconds on an AMD A10-7700k. With Java 8u40 it takes around 260 seconds to do the same thing. No changes to code, no nothing, just a Java update.

Has anyone else noticed Math.round() getting much slower, especially when it is used repetitiously? It's almost as though the JVM was doing some sort of optimization before that it isn't doing anymore. Maybe it was using SIMD prior to 8u40 and it isn't now?

edit: I have completed my second attempt at an MVCE. You can download the first attempt here:

https://www.dropbox.com/s/rm2ftcv8y6ye1bi/MathRoundMVCE.zip?dl=0

The second attempt is below. My first attempt has been otherwise removed from this post as it has been deemed to be too long, and is prone to dead code removal optimizations by the JVM (which apparently are happening less in 8u40).

import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

public class MathRoundMVCE
{           
    static long grandtotal = 0;
    static long sumtotal = 0;

    static float[] float4 = new float[128];
    static float[] float5 = new float[128];
    static int[] int6 = new int[128];
    static int[] int7 = new int[128];
    static int[] int8 = new int[128];
    static long[] longarray = new long[480];

    final static int mil = 1000000;

    public static void main(String[] args)
    {       
        initmainarrays();
        OmniCode omni = new OmniCode();
        grandtotal = omni.runloops() / mil;
        System.out.println("Total sum of operations is " + sumtotal);
        System.out.println("Total execution time is " + grandtotal + " milliseconds");
    }   

    public static long siftarray(long[] larray)
    {
        long topnum = 0;
        long tempnum = 0;
        for (short i = 0; i < larray.length; i++)
        {
            tempnum = larray[i];
            if (tempnum > 0)
            {
                topnum += tempnum;
            }
        }
        topnum = topnum / Runtime.getRuntime().availableProcessors();
        return topnum;
    }

    public static void initmainarrays()
    {
        int k = 0;

        do
        {           
            float4[k] = (float)(Math.random() * 12) + 1f;
            float5[k] = (float)(Math.random() * 12) + 1f;
            int6[k] = 0;

            k++;
        }
        while (k < 128);        
    }       
}

class OmniCode extends Thread
{           
    volatile long totaltime = 0;
    final int standard = 16777216;
    final int warmup = 200000;

    byte threads = 0;

    public long runloops()
    {
        this.setPriority(MIN_PRIORITY);

        threads = (byte)Runtime.getRuntime().availableProcessors();
        ExecutorService executor = Executors.newFixedThreadPool(threads);

        for (short j = 0; j < 48; j++)
        {           
            executor.execute(new RoundFloatToIntAlternate(warmup, (byte)j));
        }

        executor.shutdown();

        while (!executor.isTerminated())
        {
            try
            {
                Thread.sleep(100);
            } 
            catch (InterruptedException e)
            {
                //Do nothing                
            }
        }

        executor = Executors.newFixedThreadPool(threads);

        for (short j = 0; j < 48; j++)
        {           
            executor.execute(new RoundFloatToIntAlternate(standard, (byte)j));          
        }

        executor.shutdown();

        while (!executor.isTerminated())
        {
            try
            {
                Thread.sleep(100);
            } 
            catch (InterruptedException e)
            {
                //Do nothing                
            }
        }

        totaltime = MathRoundMVCE.siftarray(MathRoundMVCE.longarray);   

        executor = null;
        Runtime.getRuntime().gc();
        return totaltime;
    }
}

class RoundFloatToIntAlternate extends Thread
{       
    int i = 0;
    int j = 0;
    int int3 = 0;
    int iterations = 0;
    byte thread = 0;

    public RoundFloatToIntAlternate(int cycles, byte threadnumber)
    {
        iterations = cycles;
        thread = threadnumber;
    }

    public void run()
    {
        this.setPriority(9);
        MathRoundMVCE.longarray[this.thread] = 0;
        mainloop();
        blankloop();    

    }

    public void blankloop()
    {
        j = 0;
        long timer = 0;
        long totaltimer = 0;

        do
        {   
            timer = System.nanoTime();
            i = 0;

            do
            {
                i++;
            }
            while (i < 128);
            totaltimer += System.nanoTime() - timer;            

            j++;
        }
        while (j < iterations);         

        MathRoundMVCE.longarray[this.thread] -= totaltimer;
    }

    public void mainloop()
    {
        j = 0;
        long timer = 0; 
        long totaltimer = 0;
        long localsum = 0;

        int[] int6 = new int[128];
        int[] int7 = new int[128];
        int[] int8 = new int[128];

        do
        {   
            timer = System.nanoTime();
            i = 0;

            do
            {
                int6[i] = Math.round(MathRoundMVCE.float4[i] + MathRoundMVCE.float5[i]);
                int7[i] = Math.round(MathRoundMVCE.float4[i] * MathRoundMVCE.float5[i]);
                int8[i] = Math.round(MathRoundMVCE.float4[i] / MathRoundMVCE.float5[i]);

                i++;
            }
            while (i < 128);
            totaltimer += System.nanoTime() - timer;

            for(short z = 0; z < 128; z++)
            {
                localsum += int6[z] + int7[z] + int8[z];
            }       

            j++;
        }
        while (j < iterations);         

        MathRoundMVCE.longarray[this.thread] += totaltimer;
        MathRoundMVCE.sumtotal = localsum;
    }
}

Long story short, this code performed about the same in 8u25 as in 8u40. As you can see, I am now recording the results of all calculations into arrays, and then summing those arrays outside of the timed portion of the loop to a local variable which is then written to a static variable at the end of the outer loop.

Under 8u25: Total execution time is 261545 milliseconds

Under 8u40: Total execution time is 266890 milliseconds

Test conditions were the same as before. So, it would appear that 8u25 and 8u31 were doing dead code removal that 8u40 stopped doing, causing the code to "slow down" in 8u40. That doesn't explain every weird little thing that's cropped up but that appears to be the bulk of it. As an added bonus, the suggestions and answers provided here have given me inspiration to improve the other parts of my hobby project, for which I am quite grateful. Thank you all for that!

I've run those methods 10K, 100K, 1M and 10M times with java7 and java8 and get pretty similar results for both. Definitely need that MCVE — beresfordt, Mar 14 '15 at 19:56
Okay, going to work on rewriting the program to focus just on the Math.round portion. I'll get to it as soon as possible. And I'll roll back to update 30 on my Windows partition if I am able . Or at least, I can try . . . in any case, the code in question is here: https://www.dropbox.com/s/53zuk227qr4wdpn/mathtestersource03112015.zip?dl=0. The methods in question are in classes RoundFloatToInt, RoundFloatToIntNoDiv, RoundFloatToIntAlternate, RoundFloatToIntNoDivAlternate, and OmniLoop (methods in OmniLoop is roundfloattointloop/roundfloattointloopalternate). — user3765373, Mar 15 '15 at 02:37
Math.sqrt() also is showing new behavior in 8u40 (read: worse performance) but it isn't as bad. I'll try to include that in the MVCE too. — user3765373, Mar 15 '15 at 02:48
Is the problem reproducible with 8u31? The more you can bisect the range within which the bug was introduced the easier it is for us. Browsing of the Mercurial repos of the OpenJDK can be done here: http://hg.openjdk.java.net/jdk8u , probably under the `hotspot` subrepos. — Iwillnotexist Idonotexist, Mar 15 '15 at 14:04
Can **anybody** reproduce this issue on Windows (7/64, for example)? The current "MCVE" is neither M nor C, and thus rather useless. The code in the current answer does not properly trigger and take into account the JIT, and is thus rather useless as well. In my tests, so far (and based on my previous experience with the JIT), the u40 has been **faster** than the u31. The machine code generated by the JIT looks surprisingly different for both versions, but since the new one seems to be faster, I have not yet analyzed where the difference comes from. — Marco13, Mar 15 '15 at 21:54
"If precision is not critical (which it doesn't seem to be since you are using floats), Math.round([float]) can be approximated to (int) ([float] + 0.5F), and Math.round([double]) to (long) ([double] + 0.5D)." — Clashsoft, Mar 16 '15 at 07:57
Iwillnotexist, the problem did not crop up when I updated my Linux system to 8u31, so I'm going to say that it did not exist with that update. Marco, sorry if my attempt at an MVCE was inadequate. I'll try to do better next time, or maybe I'll even take a whack at cutting down the existing attempt at a second one. — user3765373, Mar 16 '15 at 17:24

Ivan Mamontov · Answer 1 · 2015-05-13T11:02:47.183

Casual benchmarking: you benchmark A, but actually measure B, and conclude you've measured C.

Modern JVMs are too complex, and do all kinds of optimization. If you try to measure some small piece of code, it is really complicated to do it correctly without very, very detailed knowledge of what the JVM is doing. The culprit of many benchmarks is the dead-code elimination: compilers are smart enough to deduce some computations are redundant, and eliminate them completely. Please read the following slides http://shipilev.net/talks/jvmls-July2014-benchmarking.pdf. In order to "fix" Adam's microbenchmark(I still can't understand what it is measuring and this "fix" does not take into account warm up, OSR and many others microbenchmarking pitfalls) we have to print the result of the computation to system output:

    int result = 0;
    long t0 = System.currentTimeMillis();
    for (int i = 0; i < 1e9; i++) {
        result += Math.round((float) i / (float) (i + 1));
    }
    long t1 = System.currentTimeMillis();
    System.out.println("result = " + result);
    System.out.println(String.format("%s, Math.round(float), %.1f ms", System.getProperty("java.version"), (t1 - t0)/1f));

As a result:

result = 999999999
1.8.0_25, Math.round(float), 5251.0 ms

result = 999999999
1.8.0_40, Math.round(float), 3903.0 ms

The same "fix" for original MVCE example

It took 401772 milliseconds to complete edu.jvm.runtime.RoundFloatToInt. <==== 1.8.0_40

It took 410767 milliseconds to complete edu.jvm.runtime.RoundFloatToInt. <==== 1.8.0_25

If you want to measure the real cost of Math#round you should write something like this(based on jmh)

package org.openjdk.jmh.samples;

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;
import org.openjdk.jmh.runner.options.VerboseMode;

import java.util.Random;
import java.util.concurrent.TimeUnit;

@State(Scope.Benchmark)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 3, time = 5, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 3, time = 5, timeUnit = TimeUnit.SECONDS)
public class RoundBench {

    float[] floats;
    int i;

    @Setup
    public void initI() {
        Random random = new Random(0xDEAD_BEEF);
        floats = new float[8096];
        for (int i = 0; i < floats.length; i++) {
            floats[i] = random.nextFloat();
        }
    }

    @Benchmark
    public float baseline() {
        i++;
        i = i & 0xFFFFFF00;
        return floats[i];
    }

    @Benchmark
    public int round() {
        i++;
        i = i & 0xFFFFFF00;
        return Math.round(floats[i]);
    }

    public static void main(String[] args) throws RunnerException {
        Options options = new OptionsBuilder()
                .include(RoundBench.class.getName())
                .build();
        new Runner(options).run();
    }
}

My results are:

1.8.0_25
Benchmark            Mode  Cnt  Score   Error  Units
RoundBench.baseline  avgt    6  2.565 ± 0.028  ns/op
RoundBench.round     avgt    6  4.459 ± 0.065  ns/op

1.8.0_40 
Benchmark            Mode  Cnt  Score   Error  Units
RoundBench.baseline  avgt    6  2.589 ± 0.045  ns/op
RoundBench.round     avgt    6  4.588 ± 0.182  ns/op

In order to find the root cause of the problem you can use https://github.com/AdoptOpenJDK/jitwatch/. To save time, I can say that the size of JITted code for Math#round was increased in 8.0_40. It is almost unnoticeable for small methods, but in case of huge methods too long sheet of machine code pollutes instruction cache.

Not really an "answer" to the question, but at least pointing out what's wrong with the tests so far. The dead code elimination (and the fact that, for a single loop in the main method, most optimizations can't be done at all...) are important, and your measurements are roughly in line with my observations about the timing until now. I'd still insist in an MCVE (in contrast to the LNCNVE that is available until now)... — Marco13, Mar 15 '15 at 23:56
I have had many problems with dead code elimination/NOOPs in the past with this hobby project. Usually the execution time is reduced to zero or near-zero since the JVM takes a scorched-earth approach to the entire containing method. The only way I've been able to avoid this without using volatile has been to use loops that are either not unrolled at all or are only partially so. One loop of 128 instructions unrolled completely will cause the JVM to toss out everything. I have not as-of-yet seen the JVM selectively eliminate portions of the method or loop, though maybe I'm seeing that now. — user3765373, Mar 16 '15 at 17:54
Another telltale sign of dead code elimination/NOOPs I've observed is related to CPU operating temperature. If the JVM is throwing out dead code, the CPU will still show 100% utilization for the duration of the program, but CPU operating temperature will hardly budge. Different types of operations will cause different operating temperatures at 100% utilization, but NOOPs barely register. Which makes sense, since the CPU isn't really doing anything except iterating an empty loop at that point (if that). I will try a rewrite storing results in an array when time permits. Something shorter. — user3765373, Mar 16 '15 at 17:57

Adam · Accepted Answer · 2015-03-16T08:42:23.663

MVCE based on OP

can likely be simplified further
changed int3 = statements to int3 += to reduce chance of dead code removal. int3 = difference from 8u31 to 8u40 is factor 3x slower. Using int3 += difference is only 15% slower.
print result to further reduce chance of dead code removal optimisations

Code

public class MathTime {
    static float[][] float1 = new float[8][16];
    static float[][] float2 = new float[8][16];

    public static void main(String[] args) {
        for (int j = 0; j < 8; j++) {
            for (int k = 0; k < 16; k++) {
                float1[j][k] = (float) (j + k);
                float2[j][k] = (float) (j + k);
            }
        }
        new Test().run();
    }

    private static class Test {
        int int3;

        public void run() {
            for (String test : new String[] { "warmup", "real" }) {

                long t0 = System.nanoTime();

                for (int count = 0; count < 1e7; count++) {
                    int i = count % 8;
                    int3 += Math.round(float1[i][0] + float2[i][0]);
                    int3 += Math.round(float1[i][1] + float2[i][1]);
                    int3 += Math.round(float1[i][2] + float2[i][2]);
                    int3 += Math.round(float1[i][3] + float2[i][3]);
                    int3 += Math.round(float1[i][4] + float2[i][4]);
                    int3 += Math.round(float1[i][5] + float2[i][5]);
                    int3 += Math.round(float1[i][6] + float2[i][6]);
                    int3 += Math.round(float1[i][7] + float2[i][7]);
                    int3 += Math.round(float1[i][8] + float2[i][8]);
                    int3 += Math.round(float1[i][9] + float2[i][9]);
                    int3 += Math.round(float1[i][10] + float2[i][10]);
                    int3 += Math.round(float1[i][11] + float2[i][11]);
                    int3 += Math.round(float1[i][12] + float2[i][12]);
                    int3 += Math.round(float1[i][13] + float2[i][13]);
                    int3 += Math.round(float1[i][14] + float2[i][14]);
                    int3 += Math.round(float1[i][15] + float2[i][15]);

                    int3 += Math.round(float1[i][0] * float2[i][0]);
                    int3 += Math.round(float1[i][1] * float2[i][1]);
                    int3 += Math.round(float1[i][2] * float2[i][2]);
                    int3 += Math.round(float1[i][3] * float2[i][3]);
                    int3 += Math.round(float1[i][4] * float2[i][4]);
                    int3 += Math.round(float1[i][5] * float2[i][5]);
                    int3 += Math.round(float1[i][6] * float2[i][6]);
                    int3 += Math.round(float1[i][7] * float2[i][7]);
                    int3 += Math.round(float1[i][8] * float2[i][8]);
                    int3 += Math.round(float1[i][9] * float2[i][9]);
                    int3 += Math.round(float1[i][10] * float2[i][10]);
                    int3 += Math.round(float1[i][11] * float2[i][11]);
                    int3 += Math.round(float1[i][12] * float2[i][12]);
                    int3 += Math.round(float1[i][13] * float2[i][13]);
                    int3 += Math.round(float1[i][14] * float2[i][14]);
                    int3 += Math.round(float1[i][15] * float2[i][15]);

                    int3 += Math.round(float1[i][0] / float2[i][0]);
                    int3 += Math.round(float1[i][1] / float2[i][1]);
                    int3 += Math.round(float1[i][2] / float2[i][2]);
                    int3 += Math.round(float1[i][3] / float2[i][3]);
                    int3 += Math.round(float1[i][4] / float2[i][4]);
                    int3 += Math.round(float1[i][5] / float2[i][5]);
                    int3 += Math.round(float1[i][6] / float2[i][6]);
                    int3 += Math.round(float1[i][7] / float2[i][7]);
                    int3 += Math.round(float1[i][8] / float2[i][8]);
                    int3 += Math.round(float1[i][9] / float2[i][9]);
                    int3 += Math.round(float1[i][10] / float2[i][10]);
                    int3 += Math.round(float1[i][11] / float2[i][11]);
                    int3 += Math.round(float1[i][12] / float2[i][12]);
                    int3 += Math.round(float1[i][13] / float2[i][13]);
                    int3 += Math.round(float1[i][14] / float2[i][14]);
                    int3 += Math.round(float1[i][15] / float2[i][15]);

                }
                long t1 = System.nanoTime();
                System.out.println(int3);
                System.out.println(String.format("%s, Math.round(float), %s, %.1f ms", System.getProperty("java.version"), test, (t1 - t0) / 1e6));
            }
        }
    }
}

Results

adam@brimstone:~$ ./jdk1.8.0_40/bin/javac MathTime.java;./jdk1.8.0_40/bin/java -cp . MathTime 
1.8.0_40, Math.round(float), warmup, 6846.4 ms
1.8.0_40, Math.round(float), real, 6058.6 ms
adam@brimstone:~$ ./jdk1.8.0_31/bin/javac MathTime.java;./jdk1.8.0_31/bin/java -cp . MathTime 
1.8.0_31, Math.round(float), warmup, 5717.9 ms
1.8.0_31, Math.round(float), real, 5282.7 ms
adam@brimstone:~$ ./jdk1.8.0_25/bin/javac MathTime.java;./jdk1.8.0_25/bin/java -cp . MathTime 
1.8.0_25, Math.round(float), warmup, 5702.4 ms
1.8.0_25, Math.round(float), real, 5262.2 ms

Observations

For trivial uses of Math.round(float) I can find no difference in performance on my platform (Linux x86_64). There is only a difference in benchmark, my previous naive and incorrect benchmarks only exposed differences in behaviour in optimisation as Ivan's answer and Marco13's comments point out.
8u40 is less aggressive in dead code elimination than previous versions, meaning more code is executed in some corner cases and hence slower.
8u40 takes slightly longer to warm up, but once "there", quicker.

Source analysis

Surprisingly Math.round(float) is a pure Java implementation rather than native, the code for both 8u31 and 8u40 is identical.

diff  jdk1.8.0_31/src/java/lang/Math.java jdk1.8.0_40/src/java/lang/Math.java
-no differences-

public static int round(float a) {
    int intBits = Float.floatToRawIntBits(a);
    int biasedExp = (intBits & FloatConsts.EXP_BIT_MASK)
            >> (FloatConsts.SIGNIFICAND_WIDTH - 1);
    int shift = (FloatConsts.SIGNIFICAND_WIDTH - 2
            + FloatConsts.EXP_BIAS) - biasedExp;
    if ((shift & -32) == 0) { // shift >= 0 && shift < 32
        // a is a finite number such that pow(2,-32) <= ulp(a) < 1
        int r = ((intBits & FloatConsts.SIGNIF_BIT_MASK)
                | (FloatConsts.SIGNIF_BIT_MASK + 1));
        if (intBits < 0) {
            r = -r;
        }
        // In the comments below each Java expression evaluates to the value
        // the corresponding mathematical expression:
        // (r) evaluates to a / ulp(a)
        // (r >> shift) evaluates to floor(a * 2)
        // ((r >> shift) + 1) evaluates to floor((a + 1/2) * 2)
        // (((r >> shift) + 1) >> 1) evaluates to floor(a + 1/2)
        return ((r >> shift) + 1) >> 1;
    } else {
        // a is either
        // - a finite number with abs(a) < exp(2,FloatConsts.SIGNIFICAND_WIDTH-32) < 1/2
        // - a finite number with ulp(a) >= 1 and hence a is a mathematical integer
        // - an infinity or NaN
        return (int) a;
    }
}

You should use `System.nanoTime()` for benchmarks, because `System.currentTimeMillis()` is not guaranteed to be monotonically increasing. This probably does not affect your interesting results here, but still. — Waldheinz, Mar 15 '15 at 17:00
@Adam Brilliant contribution to solving this problem. Плус 1. — Iwillnotexist Idonotexist, Mar 15 '15 at 21:04
@Adam Can you follow the answers to [this](http://stackoverflow.com/questions/1503479) SO question to dump the machine code for both versions? I think it will reveal the problem and where to go next. I'm unable to find a cause for this bug by examining the Hotspot VM's Mercurial repo commits and logs [here](http://hg.openjdk.java.net/jdk8u/hs-dev/hotspot/). — Iwillnotexist Idonotexist, Mar 15 '15 at 22:36
Can do, but won't be until tomorrow. Interestingly the -Xcomp option causes results to be same, I think it's profiling related... Ie the trigger to optimise is failing... I might pull full source trees for 8u31 and 8u40 and go through all the diffs — Adam, Mar 15 '15 at 22:47
So with the "adjusted" benchmark (that does not throw away the `result`, but still does not allow things like inlining or OSR...) the difference is 3.2 vs. 3.6 seconds? If you repeat the call to `Math.round` for, maybe, 10 times in the loop, you will see that the difference vanishes (and may even be in favor of u40 in the end). — Marco13, Mar 15 '15 at 23:59
Please, warm up the JVM before test, extract test code in a separate method... There are a lot of pitfalls in your microbencmark. It is not easy to write a correct microbenchmark, see [How do I write a correct micro-benchmark in Java](http://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java)? — Ivan Mamontov, Mar 16 '15 at 04:51
@Ivan Thanks for the important lesson in benchmarking, warmup added. — Adam, Mar 16 '15 at 07:08
@Marco13 Thanks for the important lesson in benchmarking, I had no idea. — Adam, Mar 16 '15 at 07:09
In order to find the root cause of this degradation you can use https://github.com/AdoptOpenJDK/jitwatch/. To save time, I can say that the size of JITted code for Math#round was increased in 8.0_40. It is almost unnoticeable for small methods, but in case of huge methods too long sheet of machine code pollutes instruction cache. — Ivan Mamontov, Mar 16 '15 at 11:05
@Adam, thank you for taking time to do a rewrite. I do not know if this addresses all of the possible changes I observed when moving from u25/u31 to u40 (u25 and u31 behaved the same on the code) since you have chosen one of the methods for your rewrite, even though several of them showed changes in performance. Perhaps Ivan is on to something in that it would have been better to use the example in RoundFloatToIntAlternate. I may adjust my code in the future to write to an array to avoid += (have had problems with overflows/Infinity which has been very slow in the past). — user3765373, Mar 16 '15 at 17:43
Having replicated some of the changes you recommended, and taking a cue from Ivan in reducing the overall amount of code in the method itself, I've rewritten my MVCE and produced results in line with yours. 8u25's performance is nearly identical to 8u40's with my new code thanks to writing all the results of the calculations to a variable that is printed at the end of program execution. Marco13 still seems to be sussing out more interesting bits of information, but the majority of the issue that I observed has been answered by your example. — user3765373, Mar 16 '15 at 23:04

score 1 · Answer 3 · edited May 23 '17 at 11:44

Not a definite answer, but maybe another small contribution.

Originally, I went through the whole chain as Adam in his answer (see the history for details), tracking down and comparing bytecode, implementations an running times - although, as pointed out in the comments, during my tests (on Win7/8), and with the "usual microbenchmark best practices", the performance difference was not as striking as suggested in the original question and the first versions of the first answers.

However, there was a difference, so I created another small test:

public class MathRoundPerformance {

    static final int size = 16;
    static float[] data = new float[size];

    public static void main(String[] args) {
        for (int i = 0; i < size; i++) {
            data[i] = i;
        }

        for (int n=1000000; n<=100000000; n+=5000000)
        {
            long t0 = System.nanoTime();
            int result = runTest(n);
            long t1 = System.nanoTime();
            System.out.printf(
                "%s, Math.round(float), %s, %s, %.1f ms\n",
                System.getProperty("java.version"),
                n, result, (t1 - t0) / 1e6);
        }
    }

    public static int runTest(int n) {
        int result = 0;
        for (int i = 0; i < n; i++) {
            int i0 = (i+0) % size;
            int i1 = (i+1) % size;
            result += Math.round(data[i0] + data[i1]);
            result += Math.round(data[i0] * data[i1]);
            result += Math.round(data[i0] / data[i1]);
        }
        return result;
    }
}

The timing results (omitting some details) have been

...
1.8.0_31, Math.round(float), 96000000, -351934592, 504,8 ms

....
1.8.0_40, Math.round(float), 96000000, -351934592, 544,0 ms

I ran the examples with a hotspot disassembler VM, using

java -server -XX:+UnlockDiagnosticVMOptions -XX:+TraceClassLoading
     -XX:+LogCompilation -XX:+PrintInlining -XX:+PrintAssembly
     MathRoundPerformance

The important thing is that the optimization is finished when the program ends (or at least, it seems to be finished). This means that the results of the last calls to the runTest method are printed without any additional JIT optimization going on between the calls.

I tried to figure out the differences by looking at the generated machine code. A large portion of the generated code was the same for both versions. But as Ivan pointed out, the number of instructions did increase in 8u40. I compared the source code of the Hotspot versions u20 and u40. I thought that there might be subtle differences in the intrinsics for floatToRawIntBits, but these files did not change. I considered that the checks for AVX or SSE4.2 that have been added recently might influence the machine code generation in an unfortunate way, but ... my assembler knowledge is not as good as I'd like it to be, and thus, I can not make a definite statement here. Overall, the generated machine code looks like it was mainly reordered (that is, mainly changed structurally), but comparing the dumps manually is a pain in the ... eye (the addresses are all different, even when the instructions are largely the same).

(I wanted to dump the results of the machine code that is generated for the runTest method here, but there is some odd limit of 30k for one answer)

I'll try to further analyze and compare the machine code dumps and the hotspot code. But in the end, it will be hard to point the finger at "the" change that caused the performance degradation - in terms of machine code that executes slower, as well as in terms of the changes in hotspot that cause the change in the machine code.

Interesting. I reworked my MVCE to be something less abominably long which I'll post here in a moment, but I found that my code ran about the same in 8u25 as in 8u40 after making changes based on information I've gleaned from the replies here. It was a wee bit slower, but it was only a 5 second difference out of ~260 second execution times. Thanks for continuing to investigate this matter. — user3765373, Mar 16 '15 at 22:53

Java 8u40 Math.round() very slow

3 Answers3

MVCE based on OP

Source analysis