Is it possible in python (maybe using dask, maybe using multiprocessing) to 'emplace' generators on cores, and then, in parallel, step through the generators and process the results?
It needs to be generators in particular (or objects with __iter__
); lists of all the yielded elements the generators yield won't fit into memory.
In particular:
With pandas, I can call read_csv(...iterator=True)
, which gives me an iterator (TextFileReader) - I can for in
it or explicitly call next multiple times. The entire csv never gets read into memory. Nice.
Every time I read a next chunk from the iterator, I also perform some expensive computation on it.
But now I have 2 such files. I would like to create 2 such generators, and 'emplace' 1 on one core and 1 on another, such that I can:
result = expensive_process(next(iterator))
on each core, in parallel, and then combine and return the result. Repeat this step until one generator or both is out of yield.
It looks like the TextFileReader is not pickleable, nor is a generator. I can't find out how to do this in dask or multiprocessing. Is there a pattern for this?