I have a generator object, that loads quite big amount of data and hogs the I/O of the system. The data is too big to fit into memory all at once, hence the use of generator. And I have a consumer that all of the CPU to process the data yielded by generator. It does not consume much of other resources. Is it possible to interleave these tasks using threads?
For example I'd guess it is possible to run the simplified code below in 11 seconds.
import time, threading
lock = threading.Lock()
def gen():
for x in range(10):
time.sleep(1)
yield x
def con(x):
lock.acquire()
time.sleep(1)
lock.release()
return x+1
However, the simplest application of threads does not run in that time. It does speed up, but I assume because of parallelism between the dispatcher which does generation and the worked. But not thanks to parallelism between workers.
import joblib
%time joblib.Parallel(n_jobs=2,backend='threading',pre_dispatch=2)((joblib.delayed(con)(x) for x in gen()))
# CPU times: user 0 ns, sys: 0 ns, total: 0 ns
# Wall time: 16 s