I adjusted your code to eliminate the points of criticism laid out in the comments. Not compiling Rust for production is the biggest problem, that introduces a 50x overhead. Beyond that, I eliminated printing while measuring, and did proper warming up of the Java code.
I would say that Java and Rust were on par after these changes, they are within 2x of each other and both have very low cost per iteration (just a fraction of a nanosecond).
Here is my code:
public class Testing {
private static final int NUM_ITERS = 1_000;
private static final int MEASURE_TIMES = 7;
public static void main(String[] args) {
for (int i = 0; i < MEASURE_TIMES; i++) {
System.out.format("%.2f ns per iteration%n", benchmark());
}
}
private static double benchmark() {
long tInit = System.nanoTime();
int c = 0;
for (int i = 0; i < NUM_ITERS; ++i) {
for (int j = 0; j < NUM_ITERS; ++j) {
for (int k = 0; k < NUM_ITERS; ++k) {
if (i*i + j*j == k*k) {
++c;
}
}
}
}
if (c % 137 == 0) {
// Use c so its computation can't be elided
System.out.println("Count is divisible by 13: " + c);
}
long tookNanos = System.nanoTime() - tInit;
return tookNanos / ((double) NUM_ITERS * NUM_ITERS * NUM_ITERS);
}
}
use std::time::SystemTime;
const NUM_ITERS: i32 = 1000;
fn main() {
let mut c = 0;
let t_init = SystemTime::now();
for i in 0..NUM_ITERS {
for j in 0..NUM_ITERS {
for k in 0..NUM_ITERS {
if i*i + j*j == k*k {
c += 1;
}
}
}
}
let took_ns = t_init.elapsed().unwrap().as_nanos() as f64;
let iters = NUM_ITERS as f64;
println!("{} ns per iteration", took_ns / (iters * iters * iters));
// Use c to ensure its computation can't be elided by the optimizer
if c % 137 == 0 {
println!("Count is divisible by 137: {}", c);
}
}
I run Java from IntelliJ, with JDK 16. I run Rust from the command line, using cargo run --release
.
Example of Java output:
0.98 ns per iteration
0.93 ns per iteration
0.32 ns per iteration
0.34 ns per iteration
0.32 ns per iteration
0.33 ns per iteration
0.32 ns per iteration
Example of Rust output:
0.600314 ns per iteration
While I'm not necessarily surprised to see Java giving a better result (its JIT compiler has been optimized for 20 years now and there's no object allocation, so no GC), I was puzzled at the overall low cost of an iteration. We can assume the expression i*i + j*j
to be hoisted out of the inner loop, which leaves just k*k
inside it.
I used a disassembler to check out the code Rust produced. It definitely involves IMUL in the innermost loop. I read this answer, which says Intel has a latency of just 3 CPU cycles for an IMUL instruction. Combine that with multiple ALUs and instruction parallelism, and the result of 1 cycle per iteration becomes more plausible.
Another interesting thing I discovered is that, if I just check c % 137 == 0
but don't print the actual value of c
in the Rust println!
statement, (only print "Count is divisible by 137"), iteration cost drops to just 0.26 ns. So Rust was able to eliminate a lot of work from the loop when I didn't ask for the exact value of c
.
UPDATE
As discussed in the comments with @trentci, I mimicked the Java code more completely, adding an outer loop that repeats the measurement, which is now in a separate function:
use std::time::SystemTime;
const NUM_ITERS: i32 = 1000;
const MEASURE_TIMES: i32 = 7;
fn main() {
let total_iters: f64 = NUM_ITERS as f64 * NUM_ITERS as f64 * NUM_ITERS as f64;
for _ in 0..MEASURE_TIMES {
let took_ns = benchmark() as f64;
println!("{} ns per iteration", took_ns / total_iters);
}
}
fn benchmark() -> u128 {
let mut c = 0;
let t_init = SystemTime::now();
for i in 0..NUM_ITERS {
for j in 0..NUM_ITERS {
for k in 0..NUM_ITERS {
if i*i + j*j == k*k {
c += 1;
}
}
}
}
// Use c to ensure its computation can't be elided by the optimizer
if c % 137 == 0 {
println!("Count is divisible by 137: {}", c);
}
return t_init.elapsed().unwrap().as_nanos();
}
Now I'm getting this output:
0.781475 ns per iteration
0.760657 ns per iteration
0.783821 ns per iteration
0.777313 ns per iteration
0.766473 ns per iteration
0.774042 ns per iteration
0.766718 ns per iteration
Another subtle change to the code that resulted in a significant change in performance. However, it also shows a key advantage of Rust over Java: there is no warmup needed to get the optimum performance.