1

I'm using the select library to parse an HTML table into a stream of Data structures.

Ideally I would like to write a function that downloads the HTML, parses it and returns an iterator. Something like this:

fn get_data_iterator(...) -> impl Iterator<Data> {
    let doc = Document::from_read(...).unwrap();
    doc.find(Name("tr")).map(tr_to_data)
}

However, doc.find() returns an Find<'a, P> which is bound to the lifetime of doc.

Is there a way to package doc with the returned iterator so that it lives as long as the iterator?

I tried writing a proxy iterator struct that would contain both doc and the iterator created with doc.find, but I couldn't find a way to do that correctly.

Fixpoint
  • 9,619
  • 17
  • 59
  • 78
  • You might have luck with [`owning_ref`](https://docs.rs/owning_ref/0.4.1/owning_ref/), though I won't make any promise. Otherwise there's no real option aside from `unsafe` , and probably `Box`-ing the document as `Find` seems to store a `&Document`, so would require the `Document` itself to be pinned. – Masklinn Jun 09 '21 at 10:13
  • This is one of the situations in Rust where you simply have to use unsafe if you want to have the kind of interface you desire. The unsafe is only used in _implementation_, though - if you do it right, the end result will be a safe public API that is _sound_, i.e. cannot be misused. If you are not adverse to unsafe, see [this answer](https://stackoverflow.com/a/67828823/1600898) for an example of what you need to do. There are also crates like `owning_ref`, but they seem to prefer dealing with raw references. – user4815162342 Jun 10 '21 at 11:43

1 Answers1

2

If you control the interface, you can provide the Document as an argument to get_data_iterator, then the lifetime of the impl Iterator<Data> can be tied to the reference that you're passing into this method, i.e.:

// lifetimes could be elided, annotation for demonstration purposes
fn get_data_iterator<'a>(doc: &'a Document, ...) -> impl Iterator<Item=Data> + 'a {
    doc.find(Name("tr")).map(tr_to_data)
}
sebpuetz
  • 2,430
  • 1
  • 7
  • 15
  • I control all the code, however I would like it to be a self-contained function that downloads the document, parses it and returns an iterator. – Fixpoint Jun 09 '21 at 11:00
  • 1
    The problem is that you're dropping the document at the end of the method scope, therefore any reference to this document returned from this method will be invalid. [This port](https://stackoverflow.com/questions/32300132/why-cant-i-store-a-value-and-a-reference-to-that-value-in-the-same-struct) has extensive description about why self-referential structs are not supported. – sebpuetz Jun 09 '21 at 11:09
  • Thanks for the link. I understand (better now) the issues with self-referential structs. However I'd like to find out if I'm maybe missing some other convenient solution. – Fixpoint Jun 09 '21 at 13:29
  • 1
    It's always an option to collect the elements and return an owned `Vec` from your function if that improves ergonomics. You can also define some extension trait for `Document` such that you can call `.data_iter` on a `Document` directly. [Playground](https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=7399df05059a556cc8097a397d66d4e0) note that this excludes the `map()` since I don't know where `tr_to_data` comes from. – sebpuetz Jun 09 '21 at 14:05