-2

I have a function that takes a list and returns the list of lists of ngrams (here n = 2). How can I parallelize this function so that running time can be reduced?

I'm trying this, but it's not working. The data_list is a list of strings.

import multiprocessing
from multiprocessing.dummy import Pool
from collections import OrderedDict

grams_list = []
data_list = ["Hello, I am learning Python",
             "Python is a very Powerful language",
             "And Learning python is easy" ]




def ngrams(input, n):
    input = input.split(' ')
    output = []    
    for i in range(len(input) - n + 1):
        output.append(input[i:i + n])
    return output

def generating_grams_list(data_list):
    for j in range(0, len(data_list)):
        grams = [' '.join(x) for x in ngrams(data_list[j], 2)]  # Creating ngrams
        grams_list.append(list(OrderedDict.fromkeys(grams)))  # removing duplicates
        # print "Creating ngrams list for each data string ", j
    return grams_list


if __name__ == '__main__':
    pool = Pool(multiprocessing.cpu_count())
    results = pool.map(generating_grams_list, data_list)
    pool.close()
    pool.join()

    for result in results:
        print("result", result)
sahil
  • 1
  • 2
  • What doesn't work exactly? Is there an error? Are the results not as expected? Using the `dummy` module you won't get any parallelization. Concurrency is not the same as parallelism – karlson Apr 06 '17 at 06:55
  • @karlson Result is not as expected – sahil Apr 06 '17 at 06:58
  • 1
    Why not extend your question with what the result is, and what you would have expected? – karlson Apr 06 '17 at 06:59
  • Why are you using `multiprocessing.dummy`? See [this question](http://stackoverflow.com/questions/26432411/multiprocessing-dummy-in-python-is-not-utilising-100-cpu) – Peter Wood Apr 06 '17 at 07:05

1 Answers1

0

First of all, with the multiprocessing.dummy module you won't reduce the duration of your program as it distributes work across threads, not processes. That means that the computation will still run on only one processor. For the difference between concurrency and parallelism see for instance this question and answer

To get real parallelization you need to spread the work across mutiple processes, e.g. using a process pool instead.

To address your actual problem (if I'm guessing right, since you didn't exactly say what your problem is):

You probably want data_list to be a list of lists of strings instead of a list of strings. With the code as it stands (ie. if data_list and grams_list were actually defined) you'd be sending a single string to each invocation of generating_grams_list, which is most likely not what you want as the for loop wouldn't make any sense (you'd be looping over characters).

As a side note: The pattern for j in range(len(x)): func(x[j]) can better be written as for j, item in enumerate(x): func(item).

Community
  • 1
  • 1
karlson
  • 5,325
  • 3
  • 30
  • 62