I'm iterating over several gigabytes of input items from a database. On each input item, I'm doing some CPU-intensive processing which produces one or more new output items, tens of gigabytes in total. The output items are then stored in another database table.
I have gotten a nice speedup by using Rayon for parallel processing. However, the database API is not thread-safe; it's Send
but not Sync
, so the I/O must be serialized.
Ideally, I would just want to write:
input_database
.read_items()
.par_bridge() // Start parallelism.
.flat_map_iter(|input_item| {
// produce an Iterator<Item = OutputItem>
})
.ser_bridge() // End parallelism. This function does not exist.
.for_each(|output_item| {
output_database.write_item(output_item);
});
Basically I want the opposite of par_bridge()
; something that runs on the thread where it's called, reads items from each thread, and produces them serially. But in the current implementation of Rayon, this doesn't seem to exist. I'm not sure whether this is because it's theoretically impossible, or whether it doesn't fit into the current design of the library.
The output is too big to collect it all into a Vec
first; it needs to be streamed into the database directly.
By the way, I'm not married to Rayon; if there's another crate that is more suitable, I'm happy to make the switch.