27

Here's a simple n-th prime program written in Rust. The implementation is inefficient, but that's not the point.

use std::time::Instant;

fn main() {
    let n = 10_000;
    let now = Instant::now();
    let nth = nth_prime(n);
    let elapsed = now.elapsed().as_micros();
    println!("prime #{} = {}", n, nth);
    println!("elapsed = {} µs", elapsed);
}

fn nth_prime(n: u64) -> u64 {
    let mut count = 1;
    let mut num = 2;
    while count < n {
        num += 1;
        if is_prime(num) {
            count += 1;
        }
    }
    num
}

fn is_prime(num: u64) -> bool {
    for i in 2..num {
        if num % i == 0 {
            return false;
        }
    }
    true
}

If I run this program on macOS natively with cargo run --release, I get a consistent execution time of ~3.4 seconds. However, if I run this program with Docker Desktop (on the same laptop), I get a consistent execution time of ~1.7 seconds. Here's the Dockerfile I'm using:

FROM rust:1.50
WORKDIR /usr/src/playground
COPY Cargo.toml .
COPY Cargo.lock .
COPY src src
RUN cargo build --release
CMD cargo run --release

However, if I switch all the u64s for u32s, then the native executable runs consistently in ~1.5 seconds, while the Docker version runs consistently in ~1.6 seconds.

Is the Linux executable optimized in a way that the macOS executable isn't?

Note that I get the same behaviour on macOS Big Sur Version 11.2.2 and on macOS Catalina Version 10.15.17. I'm using Docker Desktop Version 3.1.0 with Engine Version 20.10.2 on both laptops, and I'm using Cargo Version 1.50.0 on both laptops.

Also, note that I tried the same benchmark on an Ubuntu 18.04 Desktop. Both the native Linux executable (cargo run --release) and the Docker version (docker run) run consistently in ~1.02 seconds for the u64 program.

Thanks!

jthulhu
  • 7,223
  • 2
  • 16
  • 33
  • 3
    Is it your suspicion that this is caused by the [next question you asked](https://stackoverflow.com/questions/66432632/why-is-docker-producing-different-images-when-building-my-rust-program-on-macos)? – kmdreko Mar 02 '21 at 03:34
  • I wish I knew! At a glance, it appears unrelated: a performance issue here versus an inconsistent build issue in my other question. But who knows if it's due to the same underlying issue with Docker Desktop... If only Docker was written in Rust I could inspect it myself :) – Félix Poulin-Bélanger Mar 02 '21 at 03:44
  • For what it's worth (likely nothing), runtime for both WSL2 and Docker (WSL2 backend) is ~1.76s after running it several times. This info is likely irrelevant though since I understand the Docker backends for mac vs win are relatively different. – cam Mar 02 '21 at 04:37
  • What's the target triple in each environment? Eg. what's the result of `rustc --print target-libdir`? – Jmb Mar 02 '21 at 07:40
  • Can you prefix `cargo run` with `time` on both invocations so we can see if it actually takes more CPU time? – HHK Mar 02 '21 at 12:25
  • @Jmb: the output of that command with the local installation of `rustc` on my machine is `/Users/felix-pb/.rustup/toolchains/stable-x86_64-apple-darwin/lib/rustlib/x86_64-apple-darwin/lib` while the output of that command if I run it in the docker container (still on the same laptop though) is `/usr/local/rustup/toolchains/1.50.0-x86_64-unknown-linux-gnu/lib/rustlib/x86_64-unknown-linux-gnu/lib`. – Félix Poulin-Bélanger Mar 02 '21 at 14:54
  • @HHK: prefixing with time on my local installation of cargo, I consistently get an output similar to `3.53s user 0.02s` while in docker (still on the same laptop though), I consistently get an output similar to `1.90user 0.01system` – Félix Poulin-Bélanger Mar 02 '21 at 15:00
  • Does it scale? What do you get for `n = 15_000`? – HHK Mar 02 '21 at 17:27
  • 2
    @HHK: For `n = 15_000`, native runs in ~8.0s while docker runs in ~4.1s. For `n = 20_000`, native runs in ~15.1s while docker runs in ~7.6s. The execution time doesn't scale linearly with `n` because of the prime distribution, but the ratio native/docker seems to be just under 2x for different values of `n`. – Félix Poulin-Bélanger Mar 03 '21 at 01:15
  • Hmm, the benchmarked code does not seem to cause any syscalls, so maybe less efficient machine code is generated for/on macOS for whatever reason. You could try to have rustc emit assembly as described [here](https://stackoverflow.com/questions/39219961/how-to-get-assembly-output-from-building-with-cargo) and check if there is anything obvious standing out (or post it on a pastebin site). – HHK Mar 03 '21 at 05:42
  • @HKK: I commented out the print statements and the time measurement statements in the main function to get rid of all syscalls. The `.s` output for native macOS is here: https://pastebin.com/wKZhNbit. And the `.s` output generated inside docker (which I `docker cp`'ed) is here: https://pastebin.com/MTUPUtrx. Unfortunately I'm not an assembly guy so it's gibberish to me. – Félix Poulin-Bélanger Mar 04 '21 at 01:45
  • @FélixPoulin-Bélanger from a rough look (used to program in assembly some time ago) it looks like the macOS translated code might use a bit more memory-related and operational commands, I would add an `assembly` tag for a pro to look but the edit queue is unfortunately full – Noam Yizraeli Sep 10 '21 at 18:37
  • 2
    This question is pretty old, but I can just say that I was not able to reproduce this on an M1 mac, the running time on docker and my local are roughly the same, local: 320472 µs and docker 316413 µs – Mahdi Dibaiee Oct 28 '22 at 04:49
  • TLDR; tried to repro in a MacOS VM, did not see the slowness For lack of a physical X86 Mac, I tried reproducing this on a VM running Mojave, but with no luck. Using the same (1.50) toolchain timing I get on MacOS is `elapsed = 1643338 µs`. The results I get running natively on my Ubuntu host, and in a docker container on that host are basically the same – chmod777 Aug 25 '23 at 06:16

1 Answers1

2

MacOS don't really run docker, so when running docker, you are also running a linux virtual machine to host the docker containers. That VM is probably using stable-x86_64-unknown-linux-gnu (because that is what docker images expects) while your native host is some other target.

You can run rustup show to see what your active toolchain is on the native host. My guess is that it is probably a rather generic mac toolchain to be compatablie with a broad range of hardware, while your actual mac is probably rather modern, so your hardware probably supports a bunch of optimizations that the LLVM optimizer (the compiler backend that rust uses) don't "dare" to enable for you.

Maybe setting RUSTFLAGS="-C target-cpu=native" before building again in release mode will give a performance boost? Or maybe somehow running an entirely different target?

So, main questions: What target are you compiling for, and what actual hardware are you running on?

Addendum: This question was asked rather long ago. Maybe the LLVM used back in rust 1.50 just wasn't that good at optimizing for mac, and things get better just by updating to a current rust version?

Rasmus Kaj
  • 4,224
  • 1
  • 20
  • 23
  • This is some good additions to the discussions in the comments, but to get the bounty I expect some more digging: either trying to recreate the issue (if you have a mac at hand), by comparing the provided assembly to find generated code that is indeed less efficient for mac, by finding an authoritative source that indicates that old LLVM was worse at generating code on mac, ... – jthulhu Aug 24 '23 at 19:50