Iterator-based code vs procedural code: how to make the iterator-based algorithm faster?

Question

I can see a huge difference in terms of performance, between an iterator-based algorithm (slow) and a procedural algorithm (fast). I want to improve the speed of the iterator-based algorithm but I don't know how to do it.

Context

I need to convert strings like 12345678 to 12_345_678. Questions like this have already been asked like in this question. (Note: I don't want to use a crate for this, because I have some custom needs about it).

Edit: it is assumed that the input string is ASCII.

I wrote two versions of the algorithm:

an iterator-based version, but I'm not satisfied because there is two .collect(), which means many allocations (at least one .collect() is extra, without being able to remove it, see the code below);
a procedural version of it.

Then I compared the execution time of the two versions: my iterator version is about ~8x slower compared to the procedural version.

Benchmark code

use itertools::Itertools; // Using itertools v0.10.0

/// Size of the string blocks to separate:
const BLOCK_SIZE: usize = 3;

/// Procedural version.
fn procedural(input: &str) -> String {
    let len = input.len();
    let nb_complete_blocks = (len as isize - 1).max(0) as usize / BLOCK_SIZE;
    let first_block_len = len - nb_complete_blocks * BLOCK_SIZE;

    let capacity = first_block_len + nb_complete_blocks * (BLOCK_SIZE + 1);
    let mut output = String::with_capacity(capacity);

    output.push_str(&input[..first_block_len]);

    for i in 0..nb_complete_blocks {
        output.push('_');
        let start = first_block_len + i * BLOCK_SIZE;
        output.push_str(&input[start..start + BLOCK_SIZE]);
    }

    output
}

/// Iterator version.
fn with_iterators(input: &str) -> String {
    input.chars()
        .rev()
        .chunks(BLOCK_SIZE)
        .into_iter()
        .map(|c| c.collect::<String>())
        .join("_")
        .chars()
        .rev()
        .collect::<String>()
}

fn main() {
    let input = "12345678";

    macro_rules! bench {
        ( $version:ident ) => {{
            let now = std::time::Instant::now();
            for _ in 0..1_000_000 {
                $version(input);
            }
            println!("{:.2?}", now.elapsed());
        }};
    }

    print!("Procedural benchmark: ");
    bench!(procedural);

    print!("Iterator benchmark: ");
    bench!(with_iterators);
}

Typical benchmark result

Procedural benchmark: 17.07ms
Iterator benchmark: 240.19ms

Question

How can I improve the iterator-based version, in order to reach the performance of the procedural version ?

Thanks for updating the figures! Even though in this case there's still a big difference, running with optimizations turned on is important for accurate benchmarks. — trent, Jun 11 '21 at 13:08

sebpuetz · Answer 1 · 2021-06-12T00:05:57.213

3

The with_iterators version allocates a new String for every chunk in the input whereas the procedural version just slices the input and appends it to the output as necessary. Afterwards, these Strings are joined into a reversed version of the target String which again has to be reversed into yet another String, including determining its char offsets and collecting into another String. This is a lot of additional work and would explain the massive slowdown.

You can do something very similar which will even be more robust than the procedural version through chars and for_each:

/// Iterator version.
fn with_iterators(input: &str) -> String {
    let n_chars = input.chars().count();
    let capacity = n_chars + n_chars / BLOCK_SIZE;
    let mut acc = String::with_capacity(capacity);
    input
        .chars()
        .enumerate()
        .for_each(|(idx, c)| {
            if idx != 0 && (n_chars - idx) % BLOCK_SIZE == 0 {
                acc.push('_');
            } 
            acc.push(c);
        });
    acc
}

Just slicing into a &str is prone to panics if you can't rule out that multi-byte encoded characters are part of the input. I.e., procedural assumes a 1-to-1 mapping between u8 and char which is not given.

chars returns each character, with enumerate() you can track its offset in terms of other characters into the str and determine when to push the '_'.

The proposed version and the procedural version both run between 10-20ms on my machine.

edited Jun 12 '21 at 00:05

answered Jun 09 '21 at 22:08

sebpuetz

2,430
1
7
15

1

This is far, far better thank you! Sorry, I forgot to say that `[u8]` was assumed, even if I prefer your logic. However if you consider utf-8 strings, there is a small error in your proposal: `(input.len() - idx)` should be `(nb_utf8_chars - idx)`, which could be calculated at the top of the function, with the right libs – yolenoyer Jun 09 '21 at 22:46
1

Another slight improvement would be to calculate the input length at the top of the function, eg: `let len = input.len();` – yolenoyer Jun 09 '21 at 22:57
Good points, I'll edit my answer tomorrow! – sebpuetz Jun 09 '21 at 23:00
This would look very similar with just a simple `for` loop instead of using `.fold()`, and it might be a shade faster. Seems like we're trading "elegance" for performance. – Todd Jun 11 '21 at 21:05
Do you have a bench to support your claim that external iteration beats the internal iteration as by `fold()`? – sebpuetz Jun 11 '21 at 23:13
Yes. I'll include it in my post. – Todd Jun 11 '21 at 23:15
Interesting, `fold()` seems to slow things down quite a bit, with `for_each()` it's on par with the explicit for loop in some criterion bench. – sebpuetz Jun 12 '21 at 00:04
The `.for_each()` implementation is actually a little faster than your same function with a `for` loop on my system. – Todd Jun 12 '21 at 02:19
BTW, `.len()` and `.with_capacity()` respectively return and take the number of bytes. If we're assuming the strings are only ascii compatible utf-8, we can treat them as either char or byte for things like getting the char count - doesn't really matter. – Todd Jun 12 '21 at 07:07

Todd · Answer 2 · 2021-06-12T20:31:20.237

This is also very slow. But here's a slightly different way for what it's worth.

fn iterators_todd(input: &str) -> String {
    let len     = input.len();
    let n_first = len % BLOCK_SIZE;
    let it      = input.chars();
    let mut out = it.clone().take(n_first).collect::<String>();

    if len > BLOCK_SIZE && n_first != 0 {
        out.push('_'); 
    }
    out.push_str(&it.skip(n_first)
                    .chunks(BLOCK_SIZE)
                    .into_iter()
                    .map(|c| c.collect::<String>())
                    .join("_"));
    out
}

I wouldn't blame the slowness entirely on .collect(). The chunking iterator may also be slowing it down. Generally, chaining together a number of iterators, then iteratively pulling data through the chain is just not going to be as efficient as approaches with fewer iterators chained together.

The code above is roughly an O(N) algorithm (but a slow one), at least judging from visual appearances of the code above without digging into the implementations of the iterators.

I used the timeit crate to time the performance of each solution. This Gist has the code that produced the results below.

iterators_yolenoyer - yolenoyer's iterator approach.
procedural_yolenoyer - yolenoyer's "procedural" function.
procedural_yolenoyer_modified - the previous with some changes.
iterators_sebpuetz - sebpuetz' iterator example.
procedural_sebpuetz - the above using a for loop instead of .for_each().
iterators_todd - the example in this answer.

Output:

          iterators_yolenoyer benchmark:    701.427605 ns
         procedural_yolenoyer benchmark:     62.651766 ns
procedural_yolenoyer_modified benchmark:     59.283306 ns
           iterators_sebpuetz benchmark:     94.315160 ns
          procedural_sebpuetz benchmark:    121.447247 ns
               iterators_todd benchmark:    606.468828 ns

It's interesting to note that using .for_each() is faster than a simple for loop (compare iterators_sebpuetz, which uses .for_each() to the same algorithm using a for loop instead, procedural_sebpuetz). The for loop being slower may not be generally the case for all use cases.

Thank you, it's better but the procedural version is still really faster! I guess this is due to the `collect::()` calls — yolenoyer, Jun 11 '21 at 12:55
Replacing the `collect()`s with `fold()` and setting the correct size shouldn't be improving the perfomance much at all. You're still allocating a `String` for each chunk. — sebpuetz, Jun 11 '21 at 23:11
Correct @sebpuetz. It shouldn't improve it, but it actually does to a tiny degree - but it's not at all significant. — Todd, Jun 11 '21 at 23:14

Iterator-based code vs procedural code: how to make the iterator-based algorithm faster?

Context

Benchmark code

Typical benchmark result

Question

2 Answers2