2

I would like to build an iterator that returns a number of elements, each element computed by a different block of code. Each block of code may or may not return an element. I'm looking for the best way to represent an iterator like this.

Minimized examples follow. Though the blocks return constants, in the real code, whether each block returns an item is based on context. Also, in the real code, there are many blocks of code (not just three), millions/billions of calls are made to this iterator, and large trees are constructed, so both space and time complexity are important.

One attempt:

pub fn it1() -> impl Iterator<Item = usize> {
    let mut state = 0;
    std::iter::from_fn(move || {
        loop {
            state += 1;
            match state {
                1 => (),             // Code A, does not emit item
                2 => break Some(3),  // Code B, emits item
                3 => break Some(45), // Code C, emits item
                _ => break None,
            }
        }
    })
}

This seems efficient since non-values don't take any resources, but a variable is needed to track which computation is taking place which seems prone to error.

Another:

pub fn it2() -> impl Iterator<Item = usize> {
    [
        { 
            None // Code A runs, does not emit item
        },
        {
            Some(3)// Code B runs, emits item
        },
        {        
            Some(45) // Code C runs, emits item
        },
    ]
    .into_iter()
    .filter_map(|x| x)
}

This does not need the state variable but needs an array. It also needs to keep the non-values and then do another pass to remove them, which is inefficient (?) with large numbers of items.

Third:

pub fn it3() -> impl Iterator<Item = usize> {
    std::iter::empty()
        .chain(std::iter::once_with(|| {
            // Code A runs, does not emit item
            None
        }))
        .chain(std::iter::once_with(|| {
            // Code B runs, emits item
            Some(3)
        }))
        .chain(std::iter::once_with(|| {
            // Code C runs, emits item
            Some(45)
        }))
        .filter_map(|x| x)
}

This does not need the array but incurs function call overhead (?) and still has the second pass to filter out non-values. Also possible many iterator chain calls, once_with calls, etc. incur unnecessary overhead.

Are there established patterns on how to build this? Obvious, preferred/idiomatic or best practices approaches?

Shepmaster
  • 388,571
  • 95
  • 1,107
  • 1,366
puritii
  • 1,069
  • 1
  • 7
  • 18
  • 2
    *a variable is needed to track which computation is taking place* — see the "nightly Rust" section of [Lazy sequence generation in Rust](https://stackoverflow.com/a/30279122/155423) – Shepmaster Dec 20 '21 at 18:40
  • 1
    `.filter_map(|x| x)` -> `.flatten()`. – Shepmaster Dec 20 '21 at 18:41
  • A state machine that implements Iterator would do nice I think. – Netwave Dec 20 '21 at 18:42
  • 2
    See also [`itertools::chain!`](https://docs.rs/itertools/latest/itertools/macro.chain.html). – Shepmaster Dec 20 '21 at 18:42
  • 3
    *"This incurs function call overhead (?)"* - it could, but all these iterator building blocks are generic and closures are distinct types so this will all be monomorphized together and has a good chance that they are inlined. *"still has the second pass to filter out non-values"* - remember, Rust iterators are lazy so this isn't a "second pass" but basically just an `if` when yielding a value, which you would need somewhere anyway. – kmdreko Dec 20 '21 at 18:48
  • If you [check the assembly for summing the three versions in release mode](https://play.rust-lang.org/?version=stable&mode=release&edition=2021&gist=b67dfee3b9c5cea2398bd4159cc4acf3), `it2` and `it3` are optimized to a constant — the optimizer sees through everything. – Shepmaster Dec 20 '21 at 18:55
  • Small note: in the first example, `state` could potentially overflow since I think by default integer literals are `i32`. – Coder-256 Dec 20 '21 at 19:07
  • @Coder-256 I don't see the path that would cause an overflow — can you share? Any state that isn't {1,2,3} will cause the iterator to terminate. – Shepmaster Dec 20 '21 at 19:10
  • @Shepmaster You can still keep repeatedly calling `.next()` on an iterator after it terminates, unless you use `.fuse()`. For various reasons, in my experience this does happen in the real world. – Coder-256 Dec 20 '21 at 19:16
  • @Coder-256 ah, I wasn't thinking of the `FusedIterator` case. It [looks like `it2` and `it3` are fused](https://play.rust-lang.org/?version=stable&mode=release&edition=2021&gist=46058af0a0d27e3edc046bf722a9b702); I wonder if that explains the optimization differences at all. However, based on that, it could overflow regardless of the size of `state` :-) You'd have to switch to an enum. – Shepmaster Dec 20 '21 at 19:22
  • 1
    It hard to answer about performance without the full code. In small examples, all approaches will equally perform, but in bigger code the compiler may not inline functions and that can make a very noticable difference. – Chayim Friedman Dec 20 '21 at 20:07
  • That being said, I think the most performant approach will be the hand-written state machine (it is very similar to what the compiler generates for generators). The array approach and `chain()` are not the same, since the first will evaluate all elements before iterating. If that doesn't matter, I think the `chain()` is better, assuming you don't use a `for` loop but internal iteration (`for_each()`, `fold()`, etc.). – Chayim Friedman Dec 20 '21 at 20:10
  • 1
    @ChayimFriedman you can change the array version to be an [array of closures](https://play.rust-lang.org/?version=stable&mode=release&edition=2021&gist=97670ad4c0c009407205fa0e1294d9ee) to regain the laziness. – Shepmaster Dec 20 '21 at 20:14

1 Answers1

1

I personally use an approach similar to your first suggestion:

pub fn it1() -> impl Iterator<Item = usize> {
    (0..3).filter_map(move |i| match i {
        0 => None,      // Code A, does not emit item
        1 => Some(3),   // Code B, emits item
        2 => Some(45),  // Code C, emits item
        _ => None,
    })
}

I don't think this has any more overhead than necessary to solve the problem, and it's also relatively concise and readable. This probably isn't an "established pattern", though – I can't remember seeing this anywhere else.

Sven Marnach
  • 574,206
  • 118
  • 941
  • 841
  • This form [optimizes the same as `it2` and `it3`](https://play.rust-lang.org/?version=stable&mode=release&edition=2021&gist=42a6eebe92dac01e032942d9809f7c94). – Shepmaster Dec 20 '21 at 20:04
  • Suggest using `unreachable!()` for the default match arm, to more clearly communicate that it's... well, unreachable. – user4815162342 Dec 20 '21 at 21:28