I've read the dask documentation, blogs and SO, but I'm still not 100% clear on how to do it. My use case:
- I have about 10GB of reference data. Once loaded they are read-only. Usually we are loading them into Dask/Pandas dataframes
- I need these ref-data to process (enrich, modify, transform) about 500 mio events a day (multiple files)
- The "process" is a pipeline of about 40 tasks. Execution sequence is relevant (dependencies).
- Each individual task is not complicated or time consuming, mostly lookups, enrichments, mappings, etc.
- There are no dependencies between the events. In theory I could process every event by a separate thread, combine the output into a single file and I'm done. The output events don't even need to be in the same order as the input events.
In summary:
- we can massively parallalize event processing
- Every parallel thread needs the same 10 GB of (raw) ref-data
- Processing a single event means applying a sequence/pipeline of 40 tasks onto them
- Each individual Task is not time consuming (read ref-data and modify the event)
Possible pitfalls / issues:
- spend more time on serialization/deserialisation rather then processing the data (we did experience this in some of our trials which used a pipe-like approaches)
- ref-data are loaded multiple times, once by each (parallel) process
- preferabbly I would like to dev/test it on my laptop, but I don't have enough memory to load the ref-data. May be if the solution would leverage memory_maps?
The most efficient solution seems to be, if we were able to load the ref-data in memory only once, make it available read-only to multiple other processes processing the events
Scale out to multiple computers by loading the ref-data in each computer. Push filenames to the computers for execution.
Any idea how to achieve this?
Thanks a lot for your help