Trying famous branch-prediction example sometimes results in strange times

Question

I tried to duplicate the example in this famous question. My code looks like this:

#![feature(test)]
extern crate rand;
extern crate test;

use test::Bencher;
use rand::{thread_rng, Rng};

type ItemType = u8;
type SumType = u64;
const TEST_SIZE: usize = 32_768;

#[bench]
fn bench_train(b: &mut Bencher) {
    let numbers = get_random_vec();
    b.iter(|| calc_sum(&numbers));
}

#[bench]
fn bench_train_sort(b: &mut Bencher) {
    let mut numbers = get_random_vec();
    numbers.sort();     // <-- the magic difference
    b.iter(|| calc_sum(&numbers));
}

fn get_random_vec() -> Vec<ItemType> {
    thread_rng().gen_iter().take(TEST_SIZE).collect()
}

fn calc_sum(numbers: &Vec<ItemType>) -> SumType {
    let mut sum = 0;
    for &num in numbers {
        if num < ItemType::max_value() / 2 {
            sum += num.into();
        }
    }

    sum
}

If I benchmark the exact code from above I get reasonable results (like in the linked question):

test bench_train      ... bench:     148,611 ns/iter (+/- 8,445)
test bench_train_sort ... bench:      21,064 ns/iter (+/- 1,980)

However, if I change SumType to u8 both versions run equally fast and much faster overall:

test bench_train      ... bench:       1,272 ns/iter (+/- 64)
test bench_train_sort ... bench:       1,280 ns/iter (+/- 170)

First of: of course, the sum will overflow all the time, but in release mode the overflow checks of Rust are disabled, so we just calculate a wrong result without panicking. Could this be the reason for the surprisingly short time?

Even stranger: when I change the implementation of calc_sum to something more idiomatic, the results change again. My second implementation:

fn calc_sum(numbers: &Vec<ItemType>) -> SumType {
    numbers.iter()
        .filter(|&&num| num < ItemType::max_value() / 2)
        .fold(0, |acc, &num| acc + (num as SumType))
}

With this implementation the SumType doesn't matter anymore. With u8 as well as with u64 I get these results:

test bench_train      ... bench:     144,411 ns/iter (+/- 12,533)
test bench_train_sort ... bench:      16,966 ns/iter (+/- 1,100)

So we again get the numbers we are expecting. So the question is:

What is the reason for the strange running times?

PS: I tested with cargo bench which compiles in release mode.

PPS: I just noticed that in the first implementation of calc_sum I use into() for casting, whereas I use as in the second example. When also using as in the first example, I get more strange numbers. With SumType = u64:

test bench_train      ... bench:      39,850 ns/iter (+/- 2,355)
test bench_train_sort ... bench:      39,344 ns/iter (+/- 2,581)

With SumType = u8:

test bench_train      ... bench:       1,184 ns/iter (+/- 339)
test bench_train_sort ... bench:       1,239 ns/iter (+/- 85)

Figuring this out would probably require looking at the machine code. You might find the Linux `perf` tool to be really useful. I may look at it later out of curiosity, but not right now. — Zan Lynx, Jun 13 '16 at 23:01
@ZanLynx Sadly, I'm not very good nor fast at reading machine code. I would appreciate more people looking at it :) — Lukas Kalbertodt, Jun 14 '16 at 11:32

svat · Accepted Answer · 2016-06-14T02:20:06.263

5

I took a quick look at the assembler code, and it appears that if you use SumType = u8 then LLVM generates SSE2 instructions to do vector operations, which is much faster. In theory, LLVM should be able to optimize filter(...).fold(...) to the same code, but in practice it cannot always remove overhead of abstraction. I hope when MIR is added, Rust won't rely on LLVM to do Rust-specific optimizations.

edited Jun 14 '16 at 02:20

answered Jun 14 '16 at 01:38

svat

136
6

1

This probably has nothing to do with closures and more to do with `Filter::next` being implemented with an internal secondary loop. LLVM probably can't see that it's equivalent to a mask over the array. – Veedrac Jun 14 '16 at 02:03
@Veedrac, you are probably right that it is `Filter::next`. – svat Jun 14 '16 at 02:17
If I understand your answer correctly, it doesn't explain why there isn't any difference between sorted and unsorted. Just being faster wouldn't completely remove the branch prediction effect, would it? – Lukas Kalbertodt Jun 14 '16 at 09:03
@lukas-kalberodt When the code is vectorized, it becomes branchless. Thus there is no difference. This fact is also mentioned in the "Update" section of the "famous question". So when I posted my answer here, I thought this fact was obvious. – svat Jul 27 '16 at 00:33
@svat I saw your comment only know, because of the wrong name spelling :o But thanks for that explanation! I'd appreciate if you could add the information from your comment to your answer! It would also be awesome, if you could explain shortly how vectorization removes branches. But I will already accept it now :) – Lukas Kalbertodt Sep 27 '16 at 11:24

Trying famous branch-prediction example sometimes results in strange times

1 Answers1