i know this question answered several times in different places, but i'm trying to find things to do in parallel. i came across this answer from Python: how to determine if a list of words exist in a string answered by @Aaron Hall. it works perfectly, but the problem is when i want to run the same snippet in parrllel using ProcessPoolExecutor or ThreadPoolExecutor it is very slow. normal execution takes 0.22 seconds to process 119288 lines, but with ProcessPoolExecutor it is taking 93 seconds. I don't understand the problem, code snippet is here.
def multi_thread_execute(): # this takes 93 seconds
lines = get_lines()
print("got {} lines".format(len(lines)))
futures = []
my_word_list = ['banking', 'members', 'based', 'hardness']
with ProcessPoolExecutor(max_workers=10) as pe:
for line in lines:
ff = pe.submit(words_in_string,my_word_list, line)
futures.append(ff)
results = [f.result() for f in futures]
single thread takes 0.22 seconds.
my_word_list = ['banking', 'members', 'based', 'hardness']
lines = get_lines()
for line in lines:
result = words_in_string(my_word_list, line)
I have 50GB + single file (google 5gram files), reading lines in parallel this works very well, but above multi thread is too much slow. is it problem of GIL. how can i improve performance.
sample format of file (single file with 50+GB, total data is 3 TB)
n.p. : The Author , 2005 1 1
n.p. : The Author , 2006 7 2
n.p. : The Author , 2007 1 1
n.p. : The Author , 2008 2 2
NP if and only if 1977 1 1
NP if and only if 1980 1 1
NP if and only if 1982 3 2