I've searched probably 10 threads on multiprocessing looking but nothing seems to fit perfectly to my usecase. Here is a general idea of what I want to parallelize.
class foo():
def boo():
filename = 'path to the data file'
with reader(filename) as fileReader:
for id, feature in fileReader:
boo2(id, feature)
def boo2(id, feature):
*process feature then save the output to a folder*
Here I want to parallelize the call to boo2()
where fileReader
is an iterator (a sequentialMatrixReader from pykaldi) with tens of thousands of rows of id
and feature
where id is a string and each feature
is a matrix (hundreds of row x tens of columns). boo2
will compute a smaller matrix and save the result to a folder based on id
. Each call to boo2
are independent from one another so I want to parallelize it.
From my understanding I can't use multiprocessing.Pool
since boo2 is a class function and I can't pull that out of the class due to it's complexity.
I don't know how to use multiprocessing.Process
since the number of cores is much less than the number of rows of the iterator and I am unsure how to queue new calls to boo2
once I've start()
and join()
processes (I've tried to split the fileReader into n batches and set a Process per batch however I'd much prefer to queue the calls in one-line vs multiple batchs)
I've also looked into pathos
module since it doesn't have problems with class functions. However from sample use-cases the closest fit to my need is:
pathos.threading.ThreadPoolpool.imap(boo2, [feature for feature in fileReader])
But because of how large fileReader is I am unable to fit [feature for feature in fileReader]
in memory.
Any and all help is appreciated. Thank you.