2

I'm trying to parallelize a function I wrote for sequential program. Below is the input and output

Input 1, list of string : ["foo bar los angles", "foo bar new york", ...]

Input 2, list of string as dictionary: ["los angles", "new york"..]

I want to remove all string in input 2 from input 1. So the output will be like:

["foo bar", "foo bar"].

I'm able to do it using a double for loop.

res = []
for s1 in input1:
    for s2 in input2:
        if s2 in s1:
            res.append(s1.replace(s2, ""))

But this run a little slow (more than 10 minutes on my macbook pro) on 2 million size of list input1 (input 2 is couple of thousands).

I found a way to use python's multithreading.dummy.Pool. And use pool.map along with a global variable to parallelize it. But I'm concern about the usage of global variable. Is it safe to do so? Is there a better way to for python multithread to share a variable (May be like apache spark's mapPartions)?

I'm using Python 2.7 now. So I'd prefer answer use python2.

xuanyue
  • 1,368
  • 1
  • 17
  • 36
  • 2
    As an intermediary step before you go ahead with the parallelization, you could probably make your code a lot faster just by switching the lists to sets (if that's okay for your problem) – Tyler Jun 14 '16 at 22:45
  • Do you need to preserve ordering of the original list? – rrauenza Jun 14 '16 at 22:47
  • @Tyler Well. I don't think use set will boost up performance. Because I still need to traverse the whole set. My requirement is remove substring in items of input1 if that substring is in input2. So I don't think use set will improve performance. – xuanyue Jun 14 '16 at 23:13
  • @rrauenza No, I don't need to preserve the ordering. – xuanyue Jun 14 '16 at 23:14

1 Answers1

2

It's generally recommended to avoid multithreading when wanting performance because of the GIL. Luckily we have multiprocessing!

#!/usr/bin/python
import itertools
import multiprocessing

in1 = ["foo bar los angles", "foo bar new york",]
in2 = ["los angles", "new york",]

results = []

def sub(arg):
    s1, s2 = arg
    if s2 in s1:
        return s1.replace(s2, "")

pool = multiprocessing.Pool(4)
for result in pool.imap(sub, itertools.product(in1, in2)):
    if result is not None:
        results.append(result)

print results

It sounds like your 2 million item list is already in memory, so you'll want to use imap not map in order to keep from turning the product into a thousands of millions item list. I also use itertools.product to do the cartesian product of your inputs -- which is what your nested loop was doing.

Your requirements were a little vague in terms of uniqueness -- you were only adding to the results if you found a match.

Since we only add to results in the main body, there is no need to worry about the global results variable. If you were using multithreading your map function could write directly to the results variable because of the GIL's protection ....but your concurrency would also be suffering from the GIL as well.

Note you can tune the imap by passing a large chunksize. You can optimize further by relaxing the ordered requirement by using imap_unordered. See multiprocessing for more information.

Community
  • 1
  • 1
rrauenza
  • 6,285
  • 4
  • 32
  • 57
  • Thank you rrauenze. But just a follow up question, what if I want to add another input for function sub? Assume input1 is list of list [["foo bar los angles"],["foo bar new york"]], and I need to pass in a int variable 0 to access the inner list and get the string. Is there a way not using global variable to share it? – xuanyue Jun 14 '16 at 23:27
  • I don't understand thus question. Can you revise your posted question to be more accurate regarding what you want to do? – rrauenza Jun 15 '16 at 00:03