The task is to parallel information retrieval for clusters. get_on_mc_retrievals
retrieves some information given a mention candidate (mc), under the hood it invokes pairwise_distances
from sklearn
which supports multiple CPUs. Even so, the resources were still very under-used and then I've implement this snippet of code.
My implementation:
def dev_get_sentence_retrieval(sent_mcs, num_retrieval):
func = partial(MentionToRetrieval.get_one_mc_retrievals, num_retrieval=num_retrieval)
pool = mp.Pool(mp.cpu_count())
pool.daemon = True
jobs = [pool.map_async(func, sent_mcs[i]) for i in range(len(sent_mcs))]
res = [j.get() for j in jobs]
pool.close()
pool.join()
return res
Problem
# test 1
for i in range(5):
res = dev_get_sentence_retrieval(conll_mcs[i], 20)
print(res)
# test 2
for i in range(5, 10):
res = dev_get_sentence_retrieval(conll_mcs[i], 20)
print(res)
Test1 can be executed smoothly, but executing test2 the program hangs forever. A couple of SO posts point pool.join()
to be the solution, it kills zombie processes, but it seems not working for my case. Did I miss anything in the code? Please feel free to make any suggestions if there are better ways to process conll_mcs
in parallel.
Note:
# structure of `conll_mcs`
conll_mcs
|--- sentence_mcs
|--- word_position_mcs
|--- mc