1

Let's say we have to perform this calculation 10000 times in a loop.

Case 1

double answer = i * 1.6712 * 1000 * 60;

Case 2

double answer = i * 100272; // 1.6712 * 1000 * 60 = 100272

where i is the loop index.

Question

What is the most efficient method (in terms on CPU cycles), case 1 or 2 and why ?

Adam Stelmaszczyk
  • 19,665
  • 4
  • 70
  • 110
TaLha Khan
  • 2,413
  • 2
  • 25
  • 39
  • 3
    I suspect that they will be same since compiler (or later JIT) can rewrite `1.6712 * 1000 * 60` as `100272.0`. – Pshemo May 31 '15 at 10:05
  • 2
    If you rewrite the first case like this: `double answer = i * (1.6712 * 1000 * 60);`, the resulting bytecode for both cases is exactly the same, since `(1.6712 * 1000 * 60)` is a compile time constant. – fabian May 31 '15 at 10:20
  • Why exactly do you care? Premature optimization is the mother of all evil! – Srini May 31 '15 at 13:30
  • 2
    @Pshemo No, it can't as the original expression is actually `(((i * 1.6712) * 1000) * 60)` and floating point multiplication is not associative. The JIT as not allowed to do it. – maaartinus May 31 '15 at 19:58
  • @maaartinus Yes, you are right. But lets just point that compiler will optimize it if we rewrite this code using fabian's suggestion like: `double answer = i * (1.6712 * 1000 * 60);`. This will make compiler create bytecode same as bytecode of `double answer = i * 100272.0;` which means it will optimize operations on compile-time constants (important detail: `1.6712 * 1000 * 60` is `double 100272.0` not `int 100272`). – Pshemo May 31 '15 at 20:21
  • @Pshemo Agreed and that's exactly what I'd do for performance (if I cared). Creating a constant with some meaningful name would be even better. – maaartinus May 31 '15 at 21:18

1 Answers1

3

Here is a JMH benchmark:

@OutputTimeUnit(TimeUnit.SECONDS)
@BenchmarkMode({ Mode.Throughput })
@Warmup(iterations = 10)
@Fork(value = 1)
@State(Scope.Benchmark)
public class MyBenchmark {
    private static final double CONSTANT = 1.6712 * 1000 * 60;
    private double x = 0;

    @Benchmark
    public void testCaseOne() {
        for (double i = 1; i < 1000_000; i++) {
            x += i * 1.6712 * 1000 * 60;
        }
    }

    @Benchmark
    public void testCaseTwo() {
        for (double i = 1; i < 1000_000; i++) {
            x += i * (1.6712 * 1000 * 60);
        }
    }

    @Benchmark
    public void testCaseThree() {
        for (double i = 1; i < 1000_000; i++) {
            x += i * 100272;
        }
    }

    @Benchmark
    public void testCaseFour() {
        final double constant = 1.6712 * 1000 * 60;
        for (double i = 1; i < 1000_000; i++) {
            x += i * constant;
        }
    }

    @Benchmark
    public void testCaseFive() {
        for (double i = 1; i < 1000_000; i++) {
            x += i * CONSTANT;
        }
    }
}

And the results:

Benchmark                   Mode  Cnt    Score    Error  Units

MyBenchmark.testCaseOne    thrpt   20  680,452 ± 15,700  ops/s
MyBenchmark.testCaseTwo    thrpt   20  721,542 ± 14,131  ops/s
MyBenchmark.testCaseThree  thrpt   20  729,411 ± 17,031  ops/s
MyBenchmark.testCaseFour   thrpt   20  735,255 ± 16,001  ops/s
MyBenchmark.testCaseFive   thrpt   20  719,481 ±  5,338  ops/s

Java version:

openjdk version "1.8.0_45-internal"
OpenJDK Runtime Environment (build 1.8.0_45-internal-b14)
OpenJDK 64-Bit Server VM (build 25.45-b02, mixed mode)

As you can see there is no significant difference in the throughput, so you can write it in the way it is most clear and easier to understand.


Regarding my previous benchmark results:

Benchmark                   Mode  Cnt     Score    Error  Units
MyBenchmark.testCaseOne    thrpt   20   228,285 ±  2,232  ops/s
MyBenchmark.testCaseTwo    thrpt   20   593,755 ±  8,165  ops/s
MyBenchmark.testCaseThree  thrpt   20  1035,549 ± 20,908  ops/s

The previous benchmark was broken - the counter in the for loop was of type int and in testCaseThree it was doing integer multiplication and that's the reason it was so much faster. The other results were also affected by this bug in the benchmark.

It's still interesting though why the mixed integer-double multiplication is so much slower than both pure int-int and double-double multiplication. Maybe because of type casting?

Adam Stelmaszczyk
  • 19,665
  • 4
  • 70
  • 110
Svetlin Zarev
  • 14,713
  • 4
  • 53
  • 82
  • I've repeated your benchmark and the order and ratio between test cases scores is different. I've run it again and it changed again. Do you get repeatable results? If no, can you increase the number of iterations? It may help. Why do you have `1000_000`? It doesn't compile with it... I use `1000000`. – Adam Stelmaszczyk May 31 '15 at 15:38
  • I've run it more than 10 times and I always get consistent results. The underscore is a valid digit delimiter in Java >= 7, so 1_000_000 == 1000000, but it's much easier to read. The results between the test may vary if your system is under load. Just make sure that the test is the only running application. Regarding the order - I rearranged the rows in the table to be in proper order, because jmh, executed them in random order. – Svetlin Zarev May 31 '15 at 15:47
  • 1_000_000 - that's cool, didn't know about it. About the order I meant order of the cases looking at their scores (which case was the fastest etc.), not print order, I also had to manually change the print order. Now I turned off all the other applications and increased warmup to 20, I got consistent [results](http://pastebin.com/vg0FgdY1) (with myself and you). – Adam Stelmaszczyk May 31 '15 at 16:15
  • Cases 2, 3, 4 are equally the fastest. Case 1 the slowest (very close to case 5), indeed. Why? It's a good question, maybe for separate SO thread. Could be casting. Maybe looking at generated [assembly](http://stackoverflow.com/questions/9341083/how-to-use-xxunlockdiagnosticvmoptions-xxcompilecommand-print-option-with-j) will bring the answer. – Adam Stelmaszczyk May 31 '15 at 16:15
  • 1
    @AdamStelmaszczyk That's simple. Unlike int multiplication, floating point multiplication is *not* associative, so the compiler mustn't regroup the operands and you have three multiplications instead of one. – maaartinus May 31 '15 at 18:26
  • @maaartinus Then why it's not ~3x (or even 2x or 1.5x) times slower ? I highly doubt that the addition and the assignment take more time than floating point multiplication – Svetlin Zarev May 31 '15 at 19:08
  • 1
    @AdamStelmaszczyk The CPU can do quite a few things in parallel and for the single multiplication case, the latency of the addition to `x` becomes a bottleneck slowing it down. I'd bet that with something like `x += ...; y += ...` in the loop body, the ratio would increase. See also [this question of mine](http://stackoverflow.com/q/21745619/581205) showing how a simple computation like `String.hashCode` gets slowed down by a factor of 4 due to such a bottleneck. – maaartinus May 31 '15 at 19:55