11

I ran a small identical benchmark on both Java and Rust.

Java:

public class Main {
    private static final int NUM_ITERS = 100;

    public static void main(String[] args) {
        long tInit = System.nanoTime();
        int c = 0;

        for (int i = 0; i < NUM_ITERS; ++i) {
            for (int j = 0; j < NUM_ITERS; ++j) {
                for (int k = 0; k < NUM_ITERS; ++k) {
                    if (i*i + j*j == k*k) {
                        ++c;
                        System.out.println(i + " " + j + " " + k);
                    }
                }
            }
        }

        System.out.println(c);
        System.out.println(System.nanoTime() - tInit);
    }
}

Rust:

use std::time::SystemTime;

const NUM_ITERS: i32 = 100;

fn main() {
    let t_init = SystemTime::now();
    let mut c = 0;

    for i in 0..NUM_ITERS {
        for j in 0..NUM_ITERS {
            for k in 0..NUM_ITERS {
                if i*i + j*j == k*k {
                    c += 1;
                    println!("{} {} {}", i, j, k);
                }
            }
        }
    }

    println!("{}", c);
    println!("{}", t_init.elapsed().unwrap().as_nanos());
}

When NUM_ITERS = 100, as expected, Rust out-performed Java

Java: 59311348 ns
Rust: 29629242 ns

But for NUM_ITERS = 1000, I saw that Rust took much longer and Java was way faster

Java: 1585835361  ns
Rust: 28623818145 ns

What could be the reason for this? Shouldn't Rust perform better than Java in this case too? Or is it because I have made some mistake in the implementation?

Update

I removed the lines System.out.println(i + " " + j + " " + k); and println!("{} {} {}", i, j, k); from the codes. And here are the outputs

NUM_ITERS = 100
Java: 3843114  ns
Rust: 29072345 ns


NUM_ITERS = 1000
Java: 1014829974  ns
Rust: 28402166953 ns

So, without the println statements, Java performs better than Rust in both cases. I simply want to know that why that is the case. Java has the Garbage Collector running and other overheads. Have I not implemented the loops in Rust optimally?

Divyanshu Pundir
  • 145
  • 1
  • 1
  • 8
  • 4
    did you compile rust in production mode or in debug? – Netwave Apr 22 '21 at 10:03
  • Did you benchmark your code correctly? https://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java – marstran Apr 22 '21 at 10:05
  • 4
    There is no such thing as a benchmark that produces outputper iteration. You are measuring I/O, not CPU time. – user207421 Apr 22 '21 at 10:12
  • As noted by the above comment, printing in a loop is not a good way to measure language performance. But specifically for rust, all calls to `println!` acquire a lock. You could optimize this by explicitly acquiring the lock once https://rust-cli.github.io/book/tutorial/output.html#a-note-on-printing-performance . The JVM JIT is smart enough to recognise the inefficient repeated-locking and perform that optimization automatically. – MikeFHay Apr 22 '21 at 10:44
  • @MikeFHay, I ran the code without the `println!` in the loop too, and I got similar results. – Divyanshu Pundir Apr 22 '21 at 10:50
  • @Netwave, these are just simple programs that I ran on intelliJ. – Divyanshu Pundir Apr 22 '21 at 10:51
  • It's technically not a benchmark. It's just a simple raw performance comparison. – Divyanshu Pundir Apr 22 '21 at 10:54
  • 2
    I think it wasn't appropriate to close this question. OP specifically asked if there was a mistake in the benchmark code, they wrote a clear example that could easily be fixed. The question was ready for a specific answer addressing those issues and providing results after fixes. – Marko Topolnik Apr 22 '21 at 11:31
  • 3
    Although it's definitely not good to print anything at all during the measurement, I think the commenters above missed the fact that the condition `a2 + b2 = c2` is satisfied just 299 times out of 1 million tested combinations (with NUM_ITERS = 100). – Marko Topolnik Apr 22 '21 at 11:37
  • 1
    I agree with @MarkoTopolnik. The question has details or clarity, although a more suitable title might better. – Deadbeef Apr 22 '21 at 11:58
  • 2
    With the Java example, you are mostly measuring first-time initialization overhead, the start in interpreted execution mode, JIT activity, etc. When you run the method multiple times within the same runtime, the time will drop significantly. Then, the smallest executions times will exhibit the expected 10³ scaling factor when changing `NUM_ITERS` by factor ten. Since the Rust execution does already exhibit the 10³ scaling factor, it might be the final state without other distorting factors, but that would imply that it is horribly slow. – Holger Apr 22 '21 at 11:58
  • Yes, @MarkoTopolnik is correct. And for `NUM_ITERS = 1000` it was printing `3755` in 1 billion iterations. – Divyanshu Pundir Apr 22 '21 at 12:00
  • 3
    @DivyanshuPundir, in IntelliJ, run with `cargo run --release` to run optimized code. – Deadbeef Apr 22 '21 at 12:01
  • @Holger, Rust claims to be a system programming language with performance similar to C/C++. And Java has the Garbage Collector running (which is missing in Rust). Shouldn't it make sense for Rust to be faster for a simple number-crunching program like this? – Divyanshu Pundir Apr 22 '21 at 12:05
  • As @Deadbeef suggested, I ran the program using `cargo run --release` for `NUM_ITERS = 1000`. Now the results are `Java: 990470026 ns` `Rust: 542113378 ns` Looks like Rust does some serious optimisations in its release mode. – Divyanshu Pundir Apr 22 '21 at 12:09
  • 3
    I just used an improved version (which I'd be more than happy to share in an anwer, if reopened), the result with optimized Rust build is 0.6 ns per iteration and for Java it's 0.32 ns per iteration. I don't find this a surprising result, there's no allocation here so GC doesn't matter, and Java's JIT compiler is very good at optimized simple code as this. – Marko Topolnik Apr 22 '21 at 12:09
  • 1
    @MarkoTopolnik I already vote to reopen. Afaik, there’s only one more vote needed… – Holger Apr 22 '21 at 12:28
  • 1
    @DivyanshuPundir 542113378 ns is the order of magnitude I get with Java too, when I run this multiple times within the same JVM. So, the higher numbers are the expected startup issues of an environment with dynamical compilation. – Holger Apr 22 '21 at 12:30
  • 1
    You want to read https://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java for example, – GhostCat Apr 22 '21 at 13:17

2 Answers2

16

I adjusted your code to eliminate the points of criticism laid out in the comments. Not compiling Rust for production is the biggest problem, that introduces a 50x overhead. Beyond that, I eliminated printing while measuring, and did proper warming up of the Java code.

I would say that Java and Rust were on par after these changes, they are within 2x of each other and both have very low cost per iteration (just a fraction of a nanosecond).

Here is my code:

public class Testing {
    private static final int NUM_ITERS = 1_000;
    private static final int MEASURE_TIMES = 7;

    public static void main(String[] args) {
        for (int i = 0; i < MEASURE_TIMES; i++) {
            System.out.format("%.2f ns per iteration%n", benchmark());
        }
    }

    private static double benchmark() {
        long tInit = System.nanoTime();
        int c = 0;
        for (int i = 0; i < NUM_ITERS; ++i) {
            for (int j = 0; j < NUM_ITERS; ++j) {
                for (int k = 0; k < NUM_ITERS; ++k) {
                    if (i*i + j*j == k*k) {
                        ++c;
                    }
                }
            }
        }
        if (c % 137 == 0) {
            // Use c so its computation can't be elided
            System.out.println("Count is divisible by 13: " + c);
        }
        long tookNanos = System.nanoTime() - tInit;
        return tookNanos / ((double) NUM_ITERS * NUM_ITERS * NUM_ITERS);
    }
}
use std::time::SystemTime;

const NUM_ITERS: i32 = 1000;

fn main() {
    let mut c = 0;

    let t_init = SystemTime::now();
    for i in 0..NUM_ITERS {
        for j in 0..NUM_ITERS {
            for k in 0..NUM_ITERS {
                if i*i + j*j == k*k {
                    c += 1;
                }
            }
        }
    }
    let took_ns = t_init.elapsed().unwrap().as_nanos() as f64;

    let iters = NUM_ITERS as f64;
    println!("{} ns per iteration", took_ns / (iters * iters * iters));
    // Use c to ensure its computation can't be elided by the optimizer
    if c % 137 == 0 {
        println!("Count is divisible by 137: {}", c);
    }
}

I run Java from IntelliJ, with JDK 16. I run Rust from the command line, using cargo run --release.

Example of Java output:

0.98 ns per iteration
0.93 ns per iteration
0.32 ns per iteration
0.34 ns per iteration
0.32 ns per iteration
0.33 ns per iteration
0.32 ns per iteration

Example of Rust output:

0.600314 ns per iteration

While I'm not necessarily surprised to see Java giving a better result (its JIT compiler has been optimized for 20 years now and there's no object allocation, so no GC), I was puzzled at the overall low cost of an iteration. We can assume the expression i*i + j*j to be hoisted out of the inner loop, which leaves just k*k inside it.

I used a disassembler to check out the code Rust produced. It definitely involves IMUL in the innermost loop. I read this answer, which says Intel has a latency of just 3 CPU cycles for an IMUL instruction. Combine that with multiple ALUs and instruction parallelism, and the result of 1 cycle per iteration becomes more plausible.

Another interesting thing I discovered is that, if I just check c % 137 == 0 but don't print the actual value of c in the Rust println! statement, (only print "Count is divisible by 137"), iteration cost drops to just 0.26 ns. So Rust was able to eliminate a lot of work from the loop when I didn't ask for the exact value of c.


UPDATE

As discussed in the comments with @trentci, I mimicked the Java code more completely, adding an outer loop that repeats the measurement, which is now in a separate function:

use std::time::SystemTime;

const NUM_ITERS: i32 = 1000;
const MEASURE_TIMES: i32 = 7;

fn main() {
    let total_iters: f64 = NUM_ITERS as f64 * NUM_ITERS as f64 * NUM_ITERS as f64;
    for _ in 0..MEASURE_TIMES {
        let took_ns = benchmark() as f64;
        println!("{} ns per iteration", took_ns / total_iters);
    }
}

fn benchmark() -> u128 {
    let mut c = 0;

    let t_init = SystemTime::now();
    for i in 0..NUM_ITERS {
        for j in 0..NUM_ITERS {
            for k in 0..NUM_ITERS {
                if i*i + j*j == k*k {
                    c += 1;
                }
            }
        }
    }
    // Use c to ensure its computation can't be elided by the optimizer
    if c % 137 == 0 {
        println!("Count is divisible by 137: {}", c);
    }
    return t_init.elapsed().unwrap().as_nanos();
}

Now I'm getting this output:

0.781475 ns per iteration
0.760657 ns per iteration
0.783821 ns per iteration
0.777313 ns per iteration
0.766473 ns per iteration
0.774042 ns per iteration
0.766718 ns per iteration

Another subtle change to the code that resulted in a significant change in performance. However, it also shows a key advantage of Rust over Java: there is no warmup needed to get the optimum performance.

Marko Topolnik
  • 195,646
  • 29
  • 319
  • 436
  • 2
    Running both of those programs on my machine yields 0.38 ns for Java and 0.10 ns for Rust as the best times per iterations. Consider adding the rustc flag `-C target-cpu=native` to let the compiler use the host machine's full instruction set. Also, which JVM are you using? The JIT compiler may be good, but we're testing it against LLVM's ahead-of-time compilation, which is expected to be more aggressive in general. – E_net4 Apr 22 '21 at 13:28
  • @E_net4thecopycat I found one surprising result: I actually changed the print statement in Rust at the last moment without rerunning the test. Changing from `println!("Count is divisible by 137")` to `println("Count is divisible by 137: {}", count)` results in a difference of 0.26 ns vs. the posted 0.60 ns! – Marko Topolnik Apr 22 '21 at 13:32
  • @MarkoTopolnik Yes, the results may be a bit surprising like that. For microbenchmarking in Rust, one would often use a statistics-driven harness such as [Criterion.rs](https://bheisler.github.io/criterion.rs/book/index.html), which provides an implementation of `black_box()`, a construct designed to avoid optimizing values out (otherwise, the `black_box` in the std library requires a nightly toolchain at the moment). – E_net4 Apr 22 '21 at 13:37
  • @E_net4thecopycat I played with the target CPU flag, more surprising results. When printing `c`, it makes the performance worse (0.67 ns), but when not printing it, it makes it even better than what you posted: a mere 0.076 ns per iteration! There must be some shortcuts the compiler is taking. – Marko Topolnik Apr 22 '21 at 13:51
  • Granted, many shortcuts are possible here. I would suspect that the JVM is taking shortcuts too, with the last 5 runs being consistently faster than the first two. Some kind of memoization perhaps? The low iteration times in general are possibly caused by loop unrolling for the employment of SIMD instructions. – E_net4 Apr 22 '21 at 14:03
  • @E_net4thecopycat The first two runs are slower because the code hasn't been fully JIT-compiled yet. This is always expected on the JVM. I doubt there's memoization, after all this is still 0.3 ns * 1,000,000,000 iterations, the runtime is non-trivial. I was thinking SIMD as well... if I had a disassembler set up, I'd check it out. – Marko Topolnik Apr 22 '21 at 14:07
  • Note that hardware also has "warming up" effects -- frequency scaling and adaptive branch prediction come to mind; there are others. In this case I wouldn't expect any of them to be significant, but to perform a fair comparison you should run the Rust version multiple times just as you do the JVM version, to show that the per-iteration cost *doesn't* decrease over time. – trent Apr 22 '21 at 15:37
  • @trentcl Adaptive branch prediction works at the nanosecond scale since you need just two passes for it to get fully biased in one direction. There is a significant effect _opposite_ to frequency scaling: thermal throttling, which is something i routinely observe when benchmarking on the laptop. This works at the second scale and it's not very significant with single-threaded code. As for running several times, it's such an instinct that I don't even see the result as "fully materialized" until it's at least 3 times about the same. But I can add a loop to Rust as well. – Marko Topolnik Apr 22 '21 at 15:44
  • As I said: I wouldn't expect any of them to be significant in this case. But microbenchmarks are unreliable enough without adding more unnecessary variables. – trent Apr 22 '21 at 15:53
  • @MarkoTopolnik, I implemented your version of the Java code in Rust and got almost a consistent `0.71 ns` per iteration. Looks like there's no warmup effect in Rust. But, I got an even more shocking result when I removed the `println!` with `c`. `0.000000076 ns` for the first iteration and `~0.000000043 ns` for the other 6. I did get the warning that said `variable 'c' is assigned to, but never used`. Could it be that Rust sees that the variable isn't being used and bypasses the computation involving it? – Divyanshu Pundir Apr 22 '21 at 16:07
  • 1
    @DivyanshuPundir Yes, that's what my comment in the code explains: `Use c to ensure its computation can't be elided by the optimizer` – Marko Topolnik Apr 22 '21 at 16:11
  • What about GraalVM and run Java as binary, this will remove any kind of warmup and also make much smaller memory needed... – kensai Jul 01 '22 at 04:57
  • @kensai Yeah, that exists, but it has many problems because Java wasn't meant to be a statically-compiled language. There are projects like Quarkus whose sole business is modifying popular libraries to be GraalVM-compatible. – Marko Topolnik Jul 01 '22 at 09:10
0

The Java VM possibly just-in-time compiles your inner loop to native assembly code after it exceeds a certain number of iterations. I am not an JVM expert but this could possibly explain why a larger number of iterations magically makes the java code run faster.

The technology behind this is called HotSpot and afaik it works differently on "client" and "server" VMs. It also depends on the vendor of the JVM if it is available and how it behaves.

https://en.wikipedia.org/wiki/HotSpot_(virtual_machine)

Peter Ertl
  • 209
  • 2
  • 5