0

I'm using Kuchiki to parse some HTML and making HTTP requests using hyper to concurrently operate on results through scoped_threadpool.

I select and iterate over listings. I decide the number of threads to allocate in the threadpool based on the number of listings:

let listings = document.select("table.listings").unwrap();
let mut pool = Pool::new(listings.count() as u32);
pool.scoped(|scope| {
    for listing in listings {
        do_stuff_with(listing);
    }
});

When I try to do this I get capture of moved value: listings. listings is kuchiki::iter::Select<kuchiki::iter::Elements<kuchiki::iter::Descendants>>, which is non-copyable -- so I get neither an implicit clone nor an explicit .clone.

Inside the pool I can just do document.select("table.listings") again and it will work, but this seems unnecessary to me since I already used it to get the count. I don't need listings after the loop either.

Is there any way for me to use a non-copyable value in a closure?

Explosion Pills
  • 188,624
  • 52
  • 326
  • 405

1 Answers1

4

Sadly, I think it's not possible the way you want to do it.

Your listings.count() consumes the iterator listings. You can avoid this by writing listings.by_ref().count(), but this won't have the desired effect, since count() will consume all elements of the iterator, so that the next call to next() will always yield None.

The only way to do achieve your goal is to somehow get the length of the iterator listings without consuming its elements. The trait ExactSizeIterator was built for this purpose, but it seems that kuchiki::iter::Select doesn't implement it. Note that this may also be impossible for that kind of iterator.

Edit: As @delnan suggested, another possibility is of course to collect the iterator into a Vec. This has some disadvantages, but may be a good idea in your case.


Let me also note, that you probably shouldn't create one thread for every line in the SELECT result set. Usually threadpools use approximately as many threads as there are CPUs.

Lukas Kalbertodt
  • 79,749
  • 26
  • 255
  • 305
  • I tried specifically using num_cpus and making the thread pool that size but it was _much_ slower. I'm not sure how to determine the optimal thread pool size in this case. When I don't use a thread pool and just spawn a thread in the loop it works okay (and is fast) too. Any other suggestions for that? – Explosion Pills Feb 27 '16 at 17:15
  • I don't have any good suggestions, except for: measure, measure, measure. For example: are all your CPU core on full load when using `num_cpus`? You can also use twice the number of CPUs. I don't know what algorithm `scoped_threadpool` uses: maybe the work for each thread is so little that the scheduling takes most of the time... --> I don't know, sorry ;) – Lukas Kalbertodt Feb 27 '16 at 17:21
  • Is `do_stuff_with` I/O-heavy? Then it makes sense that N threads can't saturate N CPUs, and if you have beefy I/O connections they probably can't saturate your that either. However, if you might end up with thousands of `listing`s it'll probably still degrade performance to launch *that many* threads, so I'd suggest experimenting with a (possibly large) multiple of the CPU count. –  Feb 27 '16 at 17:25
  • @LukasKalbertodt: Whether an exact count can be determined is really dependent on the implementation. A no-allocation streaming implementation would not be able to predict the count, as to get it requires fully parsing the document and storing all elements. – Matthieu M. Feb 27 '16 at 17:27
  • Oh and another possibility would be to `.collect()` the query result in a vector. This decreases the parallelism a bit, but memory-wise you already risk having all results in memory at once (if you keep launching one thread per listing), so that's not a concern here. –  Feb 27 '16 at 17:28
  • @MatthieuM. Oh sure, you're right. I didn't even read what `kuchiki` is about and assumed it's an SQL SELECT >_< (which sometimes can tell the size of the result set) – Lukas Kalbertodt Feb 27 '16 at 17:34
  • @LukasKalbertodt: In any case your advice is sound, the size of the thread pool should depend on available hardware, not spawning one thread per computation is the point of using a pool. – Matthieu M. Feb 27 '16 at 17:53