Why is this Rust program so slow? Did I miss something?

Question

I read Minimal distance in Manhattan metric and rewrote the author's "naive" implementation in Rust. The C++ variant is:

#include <utility>
#include <cstdio>
#include <cstdlib>

std::pair<int, int> pointsA[1000001];
std::pair<int, int> pointsB[1000001];

int main() {
    int n, t;
    unsigned long long dist;

    scanf("%d", &t);

    while(t-->0) {
        dist = 4000000000LL;
        scanf("%d", &n);

        for(int i = 0; i < n; i++) {
            scanf("%d%d", &pointsA[i].first, &pointsA[i].second);
        }

        for(int i = 0; i < n; i++) {
            scanf("%d%d", &pointsB[i].first, &pointsB[i].second);
        }

        for(int i = 0; i < n ;i++) {
            for(int j = 0; j < n ; j++) {
                if(abs(pointsA[i].first - pointsB[j].first) + abs(pointsA[i].second - pointsB[j].second) < dist)
                    dist = abs(pointsA[i].first - pointsB[j].first) + abs(pointsA[i].second - pointsB[j].second);
            }
        }
        printf("%lld\n", dist);
    }
}

The Rust variant is:

use std::io;
use std::io::BufReader;
use std::io::BufRead;

fn read_array(stdin: &mut BufReader<io::Stdin>, array_len: usize, points: &mut Vec<(i32, i32)>) {
    let mut line = String::new();
    for _ in 0..array_len {
        line.clear();
        stdin.read_line(&mut line).unwrap();
        let mut item = line.split_whitespace();
        let x = item.next().unwrap().parse().unwrap();
        let y = item.next().unwrap().parse().unwrap();
        points.push((x, y));
    }
}

fn manhattan_dist(a: &(i32, i32), b: &(i32, i32)) -> u32 {
    ((a.0 - b.0).abs() + (a.1 - b.1).abs()) as u32
}

fn main() {
    let mut line = String::new();
    let mut stdin = BufReader::new(io::stdin());
    stdin.read_line(&mut line).unwrap();
    let n_iters = line.trim_right().parse::<usize>().unwrap();
    let mut points_a = Vec::with_capacity(10000);
    let mut points_b = Vec::with_capacity(10000);
    for _ in 0..n_iters {
        line.clear();
        stdin.read_line(&mut line).unwrap();
        let set_len = line.trim_right().parse::<usize>().unwrap();
        points_a.clear();
        points_b.clear();
        read_array(&mut stdin, set_len, &mut points_a);
        read_array(&mut stdin, set_len, &mut points_b);
        let mut dist = u32::max_value();
        for i in points_a.iter() {
            for j in points_b.iter() {
                dist = std::cmp::min(manhattan_dist(i, j), dist);
            }
        }
        println!("{}", dist);
    }
}

Then, I generated data with a Python script:

import random

ITER = 100
N = 10000
MAX_INT = 1000000

print("%d" % ITER)

for _ in range(0, ITER):
    print("%d" % N)
    for _ in range(0, N):
        print(random.randrange(-MAX_INT, MAX_INT + 1), random.randrange(1, MAX_INT + 1))
    for _ in range(0, N):
        print(random.randrange(-MAX_INT, MAX_INT + 1), random.randrange(-MAX_INT, 0))

And compiled both variants with g++ -Ofast -march=native and rustc -C opt-level=3 respectively. The timings are:

C++

real    0m7.789s
user    0m7.760s
sys     0m0.020s

Rust

real    0m28.589s
user    0m28.570s
sys     0m0.010s

Why is my Rust code four times slower than the C++ variant? I am using Rust 1.12.0-beta.1.

I added time measurements:

let now = SystemTime::now();
line.clear();
stdin.read_line(&mut line).unwrap();
let set_len = line.trim_right().parse::<usize>().unwrap();
points_a.clear();
points_b.clear();
read_array(&mut stdin, set_len, &mut points_a);
read_array(&mut stdin, set_len, &mut points_b);
io_time += now.elapsed().unwrap();

let now = SystemTime::now();
let mut dist = u32::max_value();
for i in points_a.iter() {
    for j in points_b.iter() {
        dist = std::cmp::min(manhattan_dist(i, j), dist);
    }
}
calc_time += now.elapsed().unwrap();

And writeln!(&mut std::io::stderr(), "io_time: {}, calc_time: {}", io_time.as_secs(), calc_time.as_secs()).unwrap(); prints io_time: 0, calc_time: 27.

I tried nightly rustc 1.13.0-nightly (e9bc1bac8 2016-08-24):

$ time ./test_rust < data.txt  > test3_res
io_time: 0, calc_time: 19

real    0m19.592s
user    0m19.560s
sys     0m0.020s
$ time ./test1 < data.txt  > test1_res

real    0m7.797s
user    0m7.780s
sys     0m0.010s

So it is at now ~ 2.7x difference on my Core i7.

The issue is most certainly that none of your implementations are equivalent. Write the Rust code exactly as the C++ version. Take a handle to stdout and stdin in the beginning of the application and lock them. Write directly to stdout's buffer instead of using the macro which incurs lock+formatting overhead. — , Aug 28 '16 at 03:23
Try building with `env RUSTFLAGS="-C target-cpu=native" cargo build --release`. The Rust compiler refuses to use various higher end CPU extensions without specifically enabling them. — , Aug 28 '16 at 03:29
FWIW, `BufReader` is not ideal usage of `stdin`; try using `stdin.lock()` instead, which provides `BufRead` for you and avoids repeated locking. The difference isn't really meaningful, though, since IO isn't a large cost here. — Veedrac, Aug 28 '16 at 10:13
@user1244932: I invite your to read the reddit discussion [here](https://www.reddit.com/r/rust/comments/4zx55q/xpost_stackoverflow_rust_version_is_4x_slower_as/), there were a number of comments that I personally found quite interesting and that are a bit too long to succinctly summarize :) — Matthieu M., Aug 28 '16 at 15:52

Veedrac · Accepted Answer · 2018-03-30T01:36:30.947

The difference is of course -march=native... kind of. Rust has this through -C target_cpu=native, but this doesn't give the same speed benefit. This is because LLVM is unwilling to vectorize in this context, whereas GCC does. You may note that using Clang, a C++ compiler that also uses LLVM, also produces relatively slow code.

To encourage LLVM to vectorize, you can move the main loop into a separate function. Alternatively, you can use a local block. If you write the code carefully as

let dist = {
    let mut dist = i32::max_value();
    for &(a, b) in &points_a[..n] {
        for &(c, d) in &points_b[..n] {
            dist = std::cmp::min(((a - c).abs() + (b - d).abs()), dist);
        }
    }
    dist
} as u32;

the difference between Rust and C++ is then near-negligible (~4%).

score 27 · Answer 2 · edited Aug 28 '16 at 17:05

27

The vast majority of the performance you're seeing in C++ is due to the flag -march=native.

This flag is not the equivalent flag to Rust's --release. It uses CPU instructions specific to the CPU it is compiled on, so math in particular is going to be way faster.

Removing that flag puts the C++ code at 19 seconds.

Then there's the unsafety present in the C++ code. None of the input is checked. The Rust code does check it, you use .unwrap() – unwrap has a performance cost, there's an assertion, then the code necessary for unwinding, etc.

Using if lets instead of raw unwraps, or ignoring results where possible, brings the Rust code down again.

Rust: 22 seconds

C++: 19 seconds

Where's the 3 seconds coming from? A bit of playing around leads me to believe it's println! vs. printf, but I don't have hard numbers for the C++ code. What I can say is that the Rust code drops to 13 seconds when I perform the printing outside of the benchmark.

TLDR: Your compiler flags are different, and your C++ code is not safe.

edited Aug 28 '16 at 17:05

Ry-

218,210
55
464
476

answered Aug 28 '16 at 02:04

IBit

380
2
9

1

`-march=native` is equivalent to `-C target-cpu=native`. Here are the times I get: C++: 12.417s; C++ with `-march=native`: 9.005s; Rust: 13.971s; Rust with `-C target-cpu=native`: 11.943s. – Yohaï-Eliel Berreby Aug 28 '16 at 08:02
First of all it is not my `c++` code, I provided link in question where it came from. – user1244932 Aug 28 '16 at 11:59
Second I doubt that your finding about `println!` vs `printf` is correct. I implement `fast` variant of this algorithm with `O(N*log(N))` instead of `O(N * N)`, but leave input/output code `as is` and now result `0.7 seconds`, similar to what `fast c++` show. So input/output can not give more then `0.1 seconds` – user1244932 Aug 28 '16 at 12:02

score 2 · Answer 3 · edited Aug 29 '16 at 03:26

2

I'm definitely not seeing any difference in execution time. On my machine,

C++:

real    0m19.672s
user    0m19.636s
sys     0m0.060s

Rust:

real    0m19.047s
user    0m19.028s
sys     0m0.040s

I compiled the Rust code with rustc -O test.rs -o ./test and the C++ code with g++ -Ofast test.cpp -o test.

I'm running Ubuntu 16.04 with Linux Kernel 4.6.3-040603-generic. The laptop I ran this on has an Intel(R) Core(TM) i5-6200U CPU and 8GB of RAM, nothing special.

edited Aug 29 '16 at 03:26

Shepmaster

388,571
95
1,107
1,366

answered Aug 28 '16 at 01:21

coder543

803
6
16

What versions do you use and how you run? I use rustc (1.12.0-beta.1) and gcc 5.3.0, and run with `time ./exe < data > /tmp/out` – user1244932 Aug 28 '16 at 01:27
rustc 1.13.0-nightly (e9bc1bac8 2016-08-24) – coder543 Aug 28 '16 at 01:29
Also I run `cd /sys/devices/system/cpu/cpufreq/ && for i in `seq 0 7`; do echo performance > policy$i/scaling_governor; done` before timing measuring, did you do the same? – user1244932 Aug 28 '16 at 01:29
and g++ (Ubuntu 5.4.0-6ubuntu1~16.04.2) 5.4.0 20160609 – coder543 Aug 28 '16 at 01:30
No, I didn't mess with my governor, but the governor would react within milliseconds, and the benchmark takes tens of seconds. At best, I would expect governor settings to make a few tenths of a second difference at best. – coder543 Aug 28 '16 at 01:31
Huh. Running it with `-march=native` sped the C++ code up to the 6.7 seconds. – coder543 Aug 28 '16 at 01:34
I rerun with `nightly`, instead of `beta` performance difference decrease from 4x to 2.7x – user1244932 Aug 28 '16 at 01:35
This looks like a very oddly specific set of circumstances that are allowing `native` to do real optimization. I've never heard of `-march=native` actually making such a difference before, and this only works with `g++`. The effect isn't nearly as dramatic with `clang++`, which shares the same backend as `rustc`. – coder543 Aug 28 '16 at 01:36
I saw such effect with `icc` vs `gcc`. On simple matrix multiplication (1000x1000) icc speedup code by 10x, because of it can in such case do vectorization, while `gcc` can not. – user1244932 Aug 28 '16 at 01:40

Why is this Rust program so slow? Did I miss something?

3 Answers3