Why for_each is much faster than for loop in release mode (cargo run -r)?

Question

I'm trying to understand why for_each performs much faster than a for .. in in my code. Even though the official docs say that the performance should be more or less the same. - https://doc.rust-lang.org/book/ch13-04-performance.html

Here is my code, it just sums up a given number:

fn main() {
    let arg1 = std::env::args().nth(1).expect("no arg given");
    let number = arg1.parse::<i128>().unwrap();

    println!("{number}");

    let mut result = 0;

    (1..=number).for_each(|n| result += n);

    println!("{result}")
}

When I replace the for_each with the following for loop, it's much slower with big numbers:

for n in 1..=number {
    result += n;
}

Timings are tested with time cargo run -r 1000000000.

I don't expect any difference in release mode. How did you do the measurement? And you did use `--release`, right? — Sven Marnach, Apr 24 '23 at 11:43
This is not a good benchmark. Good one at least repeats measurement several times. Anything could disrupt measurement, for example [dynamic frequency scaling](https://en.wikipedia.org/wiki/Dynamic_frequency_scaling) can speed up or slow down your CPU at random points in time — JL0PD, Apr 24 '23 at 11:50
I did some quick experiments, and it indeed looks like the iterator version gets replaced by `n * (n + 1) / 2`, while the for loop is not optimized in that way. — Sven Marnach, Apr 24 '23 at 11:57
It's easier to optimise the `for_each` in the general case because side-effect analysis is limited to the captures of the closure. I actually thought the docs/book specifically mentioned that `for_each` is often faster... — Peter Hall, Apr 24 '23 at 12:10
Looking at the assembly essentially confirms that the `for_each` code is constant time (though the formula used is `(n - 2) * (n - 3) / 2 + 2 * n - 3 + n` for some reason, which simplifies to the formula given above). I'm sure gcc will simplify the equivalent C code with a for loop to some O(1) code, but I haven't tried with clang. I'm surprised the compiler can't figure it out in this simple case, but you rarely get the guarantee that a certain optimization is applied. — Sven Marnach, Apr 24 '23 at 12:15
Unfortunately, there's a known interaction issue with `RangeInclusive` (and indeed any `Chain` iterator) and LLVM: LLVM does not manage to split the loop between "part 1" and "part 2", which in turn prevents Scalar Evolution from coming up with a closed formula. Both `RangeInclusive` and `Chain` iterators provide a manually split loop in their `Iterator::fold` implementation which circumvents the issue, allowing Scalar Evolution to do its job, and leading to the much better performance you observed for `Iterator::for_each` (which uses `fold` under the hood). — Matthieu M., Apr 24 '23 at 12:20
This is not a duplicate of the linked question. One of the *answers* happens to cover one element at play here, but the questions are distinct. — user3840170, Apr 25 '23 at 12:58

ChrisB · Accepted Answer · 2023-04-27T15:51:22.093

I simplified your example into a function to minimize the assembly output:

pub fn foo(number: i32) -> i32 {
    let mut result = 0;
    (1..=number).for_each(|n| result += n);
    result
}

versus

pub fn bar(number: i32) -> i32 {
    let mut result = 0;
    for n in 1..=number {
        result += n;
    }
    result
}

https://godbolt.org/z/4nEPGvhvP

(Your version used i128, but i32 generates the same symptoms using much less assembly).

We can see in the generated assembly that LLVM is able to optimize foo into a single multiplication (-> constant runtime): n * (n + 1) / 2 (it actually uses a slightly more complicated version of this, presumably to avoid integer overflows: (n - 2) * (n - 3) / 2 + 2 * n - 3 + n )
bar on the other hand gets implemented as a straight forward loop (-> linear runtime)

It is therefore not surprising that you see a drastic performance difference.

This bad codegeneration for bar seems to be the result of a known issue regarding the optimization of loops over closed intervals in the current Rust/LLVM toolchain.

User @Mathieu M., who was participating in the linked github issue, explains more details about this in a related question over here.

Why for_each is much faster than for loop in release mode (cargo run -r)?

1 Answers1