Why does iteration over an inclusive range generate longer assembly in Rust than in C++?

Question

These two loops are supposed to be equivalent in C++ and Rust:

#include <cstdint>

std::uint64_t sum1(std::uint64_t n) {
    std::uint64_t sum = 0;
    for (std::uint64_t j = 0; j <= n; ++j) {
        sum += 1;
    }
    return sum;
}

pub fn sum1(num: u64) -> u64 {
    let mut sum: u64 = 0;
    for j in 0u64..=num {
        sum += 1;
    }
    return sum;
}

However the C++ version generates a very terse assembly:

sum1(unsigned long):                               # @sum1(unsigned long)
        xor     eax, eax
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        add     rax, 1
        cmp     rax, rdi
        jbe     .LBB0_1
        ret

while Rust's version is very long with two checks in the loop instead of one:

example::sum1:
        xor     eax, eax
        xor     ecx, ecx
.LBB0_1:
        mov     rdx, rcx
        cmp     rcx, rdi
        adc     rcx, 0
        add     rax, 1
        cmp     rdx, rdi
        jae     .LBB0_3
        cmp     rcx, rdi
        jbe     .LBB0_1
.LBB0_3:
        ret

Godbolt: https://godbolt.org/z/xYW94qxjK

What is Rust intrinsically trying to prevent that C++ is carefree about?

I think it's something about inclusive ranges... I don't know the details, but I saw it mentioned recently. Try seeing what happens if you change the Rust loop to `for j in 0..num+1` — Herohtar, Jan 11 '22 at 19:47
@Herohtar Well that optimizes to a closed formula and no loop. Same happens with C++ as well. — , Jan 11 '22 at 19:51
@user3840170 Sorry for the noob question (c++ dev transitioning to Rust apologies) but is there a way to constrain a variable like with C++ `__builtin_assume( num<100);`? — , Jan 11 '22 at 19:59
@Jellyboy There's [`core::intrinsics::assume`](https://doc.rust-lang.org/stable/std/intrinsics/fn.assume.html), but it's perma-unstable. — Chayim Friedman, Jan 11 '22 at 20:00
@Jellyboy On stable, you can do `if num < 100 { unsafe { core::hint::unreachable_unchecked(); } }`. — Chayim Friedman, Jan 11 '22 at 20:07
@Jellyboy: For just seeing what a compiler will do with a loop, it's often sufficient to do `num &= 0xffff` or a right-shift or something; the extra AND or MOVZX instruction will be outside the loop. (And as we saw on a question a couple days ago about sum(1..n), some clang optimizer passes don't benefit from the value-range info that `__builtin_assume( num<100)` gives, but do work when there's actual guaranteed truncation as part of the program logic. — Peter Cordes, Jan 13 '22 at 01:33
FWIW, I have a crate I use for stating these kinds of assumptions succinctly, https://crates.io/crates/assume. E.g. `assume!(unsafe: num < 100)`. — GManNickG, Jan 13 '22 at 21:57

user3840170 · Accepted Answer · 2022-07-09T15:02:22.743

43

Overflow in the iterator state.

The C++ version will loop forever when given a large enough input:

#include <cstdint>

std::uint64_t sum1(std::uint64_t n) {  
    std::uint64_t sum = 0;
    for (std::uint64_t j = 0; j <= n; ++j) {
        __asm__ ("");
        sum += 1;
    }
    return sum;
}

#include <iostream>

int main() {
    std::cout << sum1(UINT64_C(0xffffffff'ffffffff)) << std::endl;

    return 0;
}

This is because when the loop counter j finally reaches 0xffffffff'ffffffff, incrementing it wraps around to 0, which means the loop invariant j <= n is always fulfilled and the loop never exits.

Strictly speaking, invoking the original version of sum1 with 0xffffffff'ffffffff infamously triggers undefined behaviour, though not because of overflow alone, but since infinite loops without externally-visible side effects are UB ([intro.progress]/1). This is why in my version I added an empty __asm__ statement to the function to act as a dummy ‘side effect’ preventing the compiler from taking ‘advantage’ of that in optimisation passes.

The Rust version, on the other hand, is not only perfectly well-defined, but iterates exactly as many times as the cardinality of the range:

use std::num::Wrapping;

fn sum1(num: u64) -> u64 {
    let mut sum = Wrapping(0);
    for _ in 0..=num {
        sum += Wrapping(1);
    }
    return sum.0;
}

fn main() {
    println!("{}", sum1(0xffffffff_ffffffff));
}

The above program (slightly modified to avoid getting bogged down in debug versus release mode differences with respect to the summation) will terminate after exactly 18 446 744 073 709 551 616 iterations and print 0; the Rust version carefully maintains iterator state so that overflow does not happen in the iterator.

edited Jul 09 '22 at 15:02

answered Jan 11 '22 at 20:09

user3840170

26,597
4
30
62

I used this version and it produces the same assembly as the one above in the question. https://godbolt.org/z/6xE49YMxd – Jan 11 '22 at 20:15
8

That's intentional. user3840170's changes were made to get a *debug* build to "work" (your version would cause a panic due to an overflow in `sum`). Release builds have overflow checks disabled (integer arithmetic wraps on overflow), so the `Wrapping` wrapper doesn't change the behavior there. – Francis Gagné Jan 11 '22 at 20:31
@FrancisGagné So I thought `-C opt-level=3` would be enough to get a release version in Godbolt. Is there anything else I'd need to pass? – Jan 11 '22 at 20:35
6

No, you're getting this backwards. You *are* getting a release build with `-C opt-level=3`. user3840170's changes are only important in a *debug* build (in which an overflow causes a panic, instead of just wrapping around; `Wrapping` replaces the panic with wrapping arithmetic). Your version has different behavior in debug vs. release mode when calling `sum1(0xffffffff_ffffffff)`, while user3840170's version has the same behavior in both modes, but they both behave the same in release mode (indeed, they generate the same assembly code, as you pointed out). – Francis Gagné Jan 11 '22 at 20:48
@Stargateur: A local variable being incremented doesn't count as a side-effect, as far as I know, as it's not observable (without further code). – Matthieu M. Jan 12 '22 at 08:36
1

@MatthieuM. that look like a strange rule of c++ it's look arbitrary and unclear – Stargateur Jan 12 '22 at 08:44
2

@MatthieuM. It was actually C++'s idea in the first place, developed as part of the addition of multithreading to C++11, and then bolted back into C11 with reservations -- compare C++11 [intro.multithread] paragraph 10 with the much weaker statement in C11 6.8.5p6. The previous versions of both standards (C++98, C99) don't contain any of this language. Anyway, the rationale appears to have been "allow compiler transformations such as removal of empty loops, even when termination cannot be proven" but I don't think that's an excuse for removal of loops that provably _don't_ terminate. – zwol Jan 12 '22 at 14:51
@zwol: Many thanks for the correction! And I'm still not sure I agree even with empty loops: I guess it's useful for loops containing only `assert`, but it's common enough in the embedded world to define `abort` as `while true {}` to be able to have the time to plug in a debugger and see where the code stopped... – Matthieu M. Jan 12 '22 at 15:21
Why doesn't the compiler do it this way with one check? `add rdx, -1` and `jc` or move the count into `rcx` and use `loop`. I thought the compiler would change counting up into down as it sees fit. – 93Iq2Gg2cZtLMO Jan 12 '22 at 21:33

score 16 · Answer 2 · edited Jan 12 '22 at 17:44

16

These two loops are equivalent in C++ and Rust

Your two code snippets don't share the same behavior. for (uint64_t j = 0; j <= n; ++j) doesn't handle n == uint64_t::MAX (make it infinite looping) while for j in 0u64..=num do (will never go into an infinite loop).

A rust equivalent code could be:

pub fn sum1(num: u64) -> u64 {
    let mut sum: u64 = 0;
    let mut j = 0;
    while j <= num {
        sum = sum.wrapping_add(1);
        j = j.wrapping_add(1);
    }
    sum
}

currently produce the following asm godbolt:

example::sum1:
        xor     eax, eax
.LBB0_1:
        add     rax, 1
        cmp     rax, rdi
        jbe     .LBB0_1
        ret

edited Jan 12 '22 at 17:44

idmean

14,540
9
54
83

answered Jan 11 '22 at 20:23

Stargateur

24,473
8
65
91

1

Elaborating a bit, this more direct translation does have the same output: https://rust.godbolt.org/z/GbToeKec7 – GManNickG Jan 11 '22 at 21:38
Shouldnt it be `saturating_add()` instead for `sum`? – Jan 12 '22 at 00:24
@Jellyboy No, unsigned arithmetic in C++ is modulo 2^k (in this case, k==64). Saturating in C++ would look like: `sum += sum == static_cast(-1) ? 0 : 1;`. – GManNickG Jan 12 '22 at 02:47

score 16 · Answer 3 · edited Apr 26 '23 at 05:49

16

@user3840170 correctly identified the difference: overflow check in Rust, and not in C++.

Still, the question remains as to why there are 2 checks per loop in the Rust version instead of 1, and the answer to that is that LLVM is not sufficiently smart and/or the RangeInclusive design is not well adapted to LLVM¹.

The optimal code generation for short loops, is to split the loop, transforming:

for j in 0u64..=num {
    sum += 1;
}

Into:

for j in 0u64..num { // equivalent to for (auto j = 0; j < num; ++j)
    sum += 1;
}

if 0 <= num {
    sum += 1;
}

This would lead to having a single branch in the main loop, and allow LLVM to switch this to a closed formula².

The Loop Splitting optimization applies to RangeInclusive and most other Chain iterators, as indeed a RangeInclusive can be thought of as a chain of a once iterator and half-exclusive range iterator (in either order). It is not always a win: like inlining, it implies duplicating the code of the loop body, which if large may lead to a significant overhead in code size.

Unfortunately, LLVM fails to split the loop. I am not sure if it's because the optimization is missing altogether, or it just fails to apply it here for some reason. I'm looking forward to the rustc_codegen_gcc backend, as GCC 7 added Loop Splitting to GCC, and it may be able to generate better code there.

¹ See this comment I left over performance issues with RangeInclusive. I spent significant time banging my head over the issue in 2019, and I dearly wish RangeInclusive (and all ranges) were NOT Iterator; it's a costly mistake.

² Chain iterators, in general, perform much better using .for_each(...), specifically because there the loop is (manually) split. See the playground for (0..=num).for_each(|_| sum += 1) being reduced to num + 1.

edited Apr 26 '23 at 05:49

user3840170

26,597
4
30
62

answered Jan 12 '22 at 10:49

Matthieu M.

287,565
48
449
722

AND I realize now if you split that loop, not only it gets optimized as per number of checks, the loop will get eliminated completely too, returning N+1. – Jan 12 '22 at 12:20
the code of rangeinclusif is so complex I still don't understand why they did like this, https://doc.rust-lang.org/src/core/iter/range.rs.html#913 call is empty but still do manual check, and code is unnecessary complex call next_spec and duplicate code everywhere and is_empty a critical function check `is_exhausted` first when 99.9% of case it's should just check the value. https://doc.rust-lang.org/src/core/ops/range.rs.html#539 I'm not at all surprise that LLVM can't optimize even this simple loop. – Stargateur Jan 12 '22 at 18:00
also I disagree about overflow check, it's not obligatory a range can be that anything you could use the Generic to do a lot of range and not would make sense to call this behavior a "overflow check" I also do not understand why you said it shouldn't be a iterator – Stargateur Jan 12 '22 at 18:04
What is `if 0 <= num` supposed to do? `num` being an unsigned type, this always evaluates to true? – Will Jan 12 '22 at 23:27
Another transformation that could work is `do{}while(j != num+1)`, where `num+1` is a wrapping add. A normal `for` loop compiles to a `do{}while()` loop [with a branch to skip it if the condition is false before the first iteration](https://stackoverflow.com/q/47783926/224132), but here the inclusive range means the trip-count is guaranteed non-zero so it actually takes *less* asm to implelement. I think the only overhead is incrementing `num`. (Or a copy of it, and tying up a register if the original `num` is needed later, assuming a compiler would rather save/restore another reg than `dec`) – Peter Cordes Jan 13 '22 at 01:49
1

@PeterCordes: While this works in this particular implementation, it doesn't in general as the `start` and `end` bounds can be "crossed": that is `3..=2` should iterate 0 times. (And with a general type, there's the issue that not all types support wrapping adds) – Matthieu M. Jan 13 '22 at 10:14
@MatthieuM. ah yes.... now I see why being an iterator is a problem.... damm. We can't change that ? – Stargateur Jan 13 '22 at 14:40
@Stargateur: Could maybe be changed as part of edition; but it may break quite a bit of code so nobody's dared until now (and the next edition is 2024). – Matthieu M. Jan 13 '22 at 14:44
@Will: Disappointment that Loop Splitting doesn't occur -- which would allow having a closed formula instead of SIMD. – Matthieu M. Jan 14 '22 at 12:01

Why does iteration over an inclusive range generate longer assembly in Rust than in C++?

3 Answers3

Linked

Related