I'm trying to parallelize a function I wrote for sequential program. Below is the input and output
Input 1, list of string : ["foo bar los angles", "foo bar new york", ...]
Input 2, list of string as dictionary: ["los angles", "new york"..]
I want to remove all string in input 2 from input 1. So the output will be like:
["foo bar", "foo bar"].
I'm able to do it using a double for loop.
res = []
for s1 in input1:
for s2 in input2:
if s2 in s1:
res.append(s1.replace(s2, ""))
But this run a little slow (more than 10 minutes on my macbook pro) on 2 million size of list input1 (input 2 is couple of thousands).
I found a way to use python's multithreading.dummy.Pool
. And use pool.map
along with a global variable to parallelize it. But I'm concern about the usage of global variable. Is it safe to do so? Is there a better way to for python multithread to share a variable (May be like apache spark's mapPartions
)?
I'm using Python 2.7 now. So I'd prefer answer use python2.