7

I am trying to implement the standard memmove function in Rust and I was wondering which method is faster for downwards iteration (where src < dest):

for i in (0..n).rev() {
    //Do copying
}

or

let mut i = n;
while i != 0 {
    i -= 1;
    // Do copying
}

Will the rev() in the for loops version significantly slow it down?

Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
EpicPotato
  • 559
  • 2
  • 5
  • 12
  • I have found that for some reason the `for` loop version of this code strangely fails with a Page Fault and a General Protection Fault (I am quite new to low-level development). For now I will be using the `while` loop version because of this. – EpicPotato Feb 21 '16 at 23:38
  • There is no reason for the for loop to fail if the while loop doesn't; the problem is likely elsewhere. – Matthieu M. Feb 22 '16 at 08:11

3 Answers3

7

TL;DR: Use the for loop.


Both should be equally fast. We can check the compiler's ability to peel away the layers of abstraction involved in the for loop quite simply:

#[inline(never)]
fn blackhole() {}

#[inline(never)]
fn with_for(n: usize) {
    for i in (0..n).rev() { blackhole(); }
}

#[inline(never)]
fn with_while(n: usize) {
    let mut i = n;
    while i > 0 {
        blackhole();
        i -= 1;
    }
}

This generates this LLVM IR:

; Function Attrs: noinline nounwind readnone uwtable
define internal void @_ZN8with_for20h645c385965fcce1fhaaE(i64) unnamed_addr #0 {
entry-block:
  ret void
}

; Function Attrs: noinline nounwind readnone uwtable
define internal void @_ZN10with_while20hc09c3331764a9434yaaE(i64) unnamed_addr #0 {
entry-block:
  ret void
}

Even if you are not versed in LLVM, it is obvious that both functions compiled down to the same IR (and thus obviously to the same assembly).

Since their performance is the same, one should prefer the more explicit for loop and reserve the while loop to cases where the iteration is irregular.


EDIT: to address starblue's concern of unfitness.

#[link(name = "snappy")]
extern {
    fn blackhole(i: libc::c_int) -> libc::c_int;
}

#[inline(never)]
fn with_for(n: i32) {
    for i in (0..n).rev() { unsafe { blackhole(i as libc::c_int); } }
}

#[inline(never)]
fn with_while(n: i32) {
    let mut i = n;
    while i > 0 {
        unsafe { blackhole(i as libc::c_int); }
        i -= 1;
    }
}

compiles down to:

; Function Attrs: noinline nounwind uwtable
define internal void @_ZN8with_for20h7cf06f33e247fa35maaE(i32) unnamed_addr #1 {
entry-block:
  %1 = icmp sgt i32 %0, 0
  br i1 %1, label %match_case.preheader, label %clean_ast_95_

match_case.preheader:                             ; preds = %entry-block
  br label %match_case

match_case:                                       ; preds = %match_case.preheader, %match_case
  %.in = phi i32 [ %2, %match_case ], [ %0, %match_case.preheader ]
  %2 = add i32 %.in, -1
  %3 = tail call i32 @blackhole(i32 %2)
  %4 = icmp sgt i32 %2, 0
  br i1 %4, label %match_case, label %clean_ast_95_.loopexit

clean_ast_95_.loopexit:                           ; preds = %match_case
  br label %clean_ast_95_

clean_ast_95_:                                    ; preds = %clean_ast_95_.loopexit, %entry-block
  ret void
}

; Function Attrs: noinline nounwind uwtable
define internal void @_ZN10with_while20hee8edd624cfe9293IaaE(i32) unnamed_addr #1 {
entry-block:
  %1 = icmp sgt i32 %0, 0
  br i1 %1, label %while_body.preheader, label %while_exit

while_body.preheader:                             ; preds = %entry-block
  br label %while_body

while_exit.loopexit:                              ; preds = %while_body
  br label %while_exit

while_exit:                                       ; preds = %while_exit.loopexit, %entry-block
  ret void

while_body:                                       ; preds = %while_body.preheader, %while_body
  %i.05 = phi i32 [ %3, %while_body ], [ %0, %while_body.preheader ]
  %2 = tail call i32 @blackhole(i32 %i.05)
  %3 = add nsw i32 %i.05, -1
  %4 = icmp sgt i32 %i.05, 1
  br i1 %4, label %while_body, label %while_exit.loopexit
}

The core loops are:

; -- for loop
match_case:                                       ; preds = %match_case.preheader, %match_case
  %.in = phi i32 [ %2, %match_case ], [ %0, %match_case.preheader ]
  %2 = add i32 %.in, -1
  %3 = tail call i32 @blackhole(i32 %2)
  %4 = icmp sgt i32 %2, 0
  br i1 %4, label %match_case, label %clean_ast_95_.loopexit

; -- while loop
while_body:                                       ; preds = %while_body.preheader, %while_body
  %i.05 = phi i32 [ %3, %while_body ], [ %0, %while_body.preheader ]
  %2 = tail call i32 @blackhole(i32 %i.05)
  %3 = add nsw i32 %i.05, -1
  %4 = icmp sgt i32 %i.05, 1
  br i1 %4, label %while_body, label %while_exit.loopexit

And the only difference is that:

  • for decrements before calling blackhole, while decrements after
  • for compares against 0, while compares against 1

otherwise, it's the same core loop.

Matthieu M.
  • 287,565
  • 48
  • 449
  • 722
  • It's a bad example, because the compiler noticed that both functions do nothing. It would be more interesting with some nontrivial computation (maybe compute and return the sum). – starblue Feb 22 '16 at 09:12
  • @starblue: Actually, the compiler noticing that it does nothing is (IMHO) a perfect example => it demonstrates exactly what I wanted, that the compiler can peel away the layers of abstractions involved in the `(0..n)` iterator, its reversal, and the iteration over its reversed form in a `for` loop. The risk with abstractions (such as the `for` loop) is always that the compiler does not manage to optimize them away; this example proves that this is not the case here. – Matthieu M. Feb 22 '16 at 09:31
  • @starblue: Here you go, using a better blackhole that LLVM cannot inline. Of course, the results are the same since in order to optimize something out LLVM first had to reduce it to its bare components. – Matthieu M. Feb 22 '16 at 09:41
3

In short: They are (nearly) equally fast -- use the for loop!


Longer version:

First: rev() only works for iterators that implement DoubleEndedIterator, which provides a next_back() method. This method is expected to run in o(n) (sublinear time), usually even O(1) (constant time). And indeed, by looking at the implementation of next_back() for Range, we can see that it runs in constant time.

Now we know that both versions have asymptotically identical runtime. If this is the case, you should usually stop thinking about it and use the solution that is more idiomatic (which is for in this case). Thinking about optimization too early often decreases programming productivity, because performance matters only in a tiny percentage of all code you write.

But since you are implementing memmove, performance might actually really matter to you. So lets try to look at the resulting ASM. I used this code:

#![feature(start)]
#![feature(test)]

extern crate test;

#[inline(never)]
#[no_mangle]
fn with_for(n: usize) {
    for i in (0..n).rev() { 
        test::black_box(i); 
    }
}

#[inline(never)]
#[no_mangle]
fn with_while(n: usize) {
    let mut i = n;
    while i > 0 {
        test::black_box(i);
        i -= 1;
    }
}

#[start]
fn main(_: isize, vargs: *const *const u8) -> isize {
    let random_enough_value = unsafe {
        **vargs as usize
    };

    with_for(random_enough_value);
    with_while(random_enough_value);
    0
}

(Playground Link)

The #[no_mangle] is to improve readability in the resulting ASM. The #inline(never) and the random_enough_value as well as the black_box are used to prevent LLVM to optimize things we don't want to be optimized. The generated ASM of this (in release mode!) with some cleanup looks like:

with_for:                       |   with_while:
    testq   %rdi, %rdi          |       testq   %rdi, %rdi
    je  .LBB0_3                 |       je  .LBB1_3
    decq    %rdi                |   
    leaq    -8(%rsp), %rax      |       leaq    -8(%rsp), %rax
.LBB0_2:                        |   .LBB1_2:
    movq    %rdi, -8(%rsp)      |       movq    %rdi, -8(%rsp)
    decq    %rdi                |       decq    %rdi
    cmpq    $-1, %rdi           |       
    jne .LBB0_2                 |       jne .LBB1_2
.LBB0_3:                        |   .LBB1_3:
    retq                        |       retq

The only difference is that with_while has two instructions less, because it's counting down to 0 instead of -1, like with_for does.

Conclusion: if you can tell that the asymptotic runtime is optimal, you should probably not think about optimization at all. Modern optimizers are clever enough to compile high level constructs down to pretty perfect ASM. Often, data layout and resulting cache efficiency is much more important than a minimal count of instructions, anyway.

If you actually need to think about optimization though, look at the ASM (or LLVM IR). In this case the for loop is actually a bit slower (more instructions, comparison with -1 instead of 0). But the number of cases where a Rust programmers should care about this, is probably miniscule.

Lukas Kalbertodt
  • 79,749
  • 26
  • 255
  • 305
1

For small N, it really shouldn't matter.

Rust is lazy on iterators; 0..n won't cause any evaluation until you actually ask for an element. rev() asks for the last element first. As far as I know, the Rust counter iterator is clever and doesn't need to generate the first N-1 elements to get the Nth one. In this specific case, the rev method is probably even faster.

In the general case, it depends on what kind of access paradigm and access time your iterator has; make sure that accessing the end takes constant time, and it doesn't make a difference.

As with all benchmarking questions, it depends. Test for your N values yourself!

Premature optimization is also evil, so if your N is small, and your loop isn't done very often... don't worry.

Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
Marcus Müller
  • 34,677
  • 4
  • 53
  • 94
  • 1
    Note that [`rev`](http://doc.rust-lang.org/std/iter/trait.Iterator.html#method.rev) is only available if the iterator implements [`DoubleEndedIterator`](http://doc.rust-lang.org/std/iter/trait.DoubleEndedIterator.html) and *generally* that is only implemented if access to the "next last" element is `O(1)`. – Shepmaster Feb 21 '16 at 23:20