finding word list in string using python parallel

Question

i know this question answered several times in different places, but i'm trying to find things to do in parallel. i came across this answer from Python: how to determine if a list of words exist in a string answered by @Aaron Hall. it works perfectly, but the problem is when i want to run the same snippet in parrllel using ProcessPoolExecutor or ThreadPoolExecutor it is very slow. normal execution takes 0.22 seconds to process 119288 lines, but with ProcessPoolExecutor it is taking 93 seconds. I don't understand the problem, code snippet is here.

def multi_thread_execute(): # this takes 93 seconds
lines = get_lines()
print("got {} lines".format(len(lines)))
futures = []
my_word_list = ['banking', 'members', 'based', 'hardness']
with ProcessPoolExecutor(max_workers=10) as pe:
    for line in lines:
        ff = pe.submit(words_in_string,my_word_list, line)
        futures.append(ff)

results = [f.result() for f in futures]

single thread takes 0.22 seconds.

my_word_list = ['banking', 'members', 'based', 'hardness']
lines = get_lines()
for line in lines:
    result = words_in_string(my_word_list, line)

I have 50GB + single file (google 5gram files), reading lines in parallel this works very well, but above multi thread is too much slow. is it problem of GIL. how can i improve performance.

sample format of file (single file with 50+GB, total data is 3 TB)

n.p. : The Author , 2005    1   1
n.p. : The Author , 2006    7   2
n.p. : The Author , 2007    1   1
n.p. : The Author , 2008    2   2
NP if and only if   1977    1   1
NP if and only if   1980    1   1
NP if and only if   1982    3   2

Sometime single thread is faster than multiprocessing. The reason it could be slow is due to the overhead needed in multiprocessing. It is true you have more cores and more threads, but it takes time to split your data evenly and join all the threads together to keep them in sync. You mentioned that I you have a 50GB+ single file, the parallel works well. In that instance the overhead of parallelism is beneficial to the overall performance. — gnahum, Jul 08 '20 at 17:13
Does this answer your question? [Does Python support multithreading? Can it speed up execution time?](https://stackoverflow.com/questions/20939299/does-python-support-multithreading-can-it-speed-up-execution-time) — griffin_cosgrove, Jul 08 '20 at 17:22
but here i'm not testing for 50GB data, it is test for 119288 lines, with parallel it is 120% slower compared to single thread. I'm new to python, so i'm not sure how this code snippiet works "return set(word_list).intersection(a_string.split())". I'm assuming there may be lock on this method. becausem i'm using parallel file reading and some other stuff, it is 10x faster with parallel except this use case. So, I'm interested to know what causes to slow down code execution — bommina, Jul 09 '20 at 03:21

score -1 · Answer 1 · answered Jul 08 '20 at 17:22

-1

Python is known to be a language that generally does not have a strong use-cases for multithreading, more can be read about why in this StackOverflow Question

answered Jul 08 '20 at 17:22

griffin_cosgrove

419
8
16

finding word list in string using python parallel

1 Answers1