4

I am new to the field of parallelizing and optimizing data mining modules in Python and I have a question about parallelizing populating a dictionary. I am actually doing an inverted indexing using values scored in a two dimensional matrix m. The code works fine but I would like to apply python reduce to make it run faster.

Here is my code:

def createInvertedIndex(matrix):
    dic={}
    for i in range(len(matrix[0])):
        if matrix[1][i] in dic.keys():
            dic[matrix[1][i]].append(matrix[0][i])
        else: 
            dic[matrix[1][i]]=list([matrix[0][i]]) 
    return dic
martineau
  • 119,623
  • 25
  • 170
  • 301
  • 4
    Why would functools.reduce improve performance? – Kelly Bundy Mar 13 '20 at 18:11
  • because I want to use it for multiprocessing. The original code is not taking advantage of multiprocessing. – user3089485 Mar 13 '20 at 18:18
  • 4
    What does it have to do with multiprocessing, though? I mean, sure, you can use it in multiprocessing, but you can also use a loop in multiprocessing, and probably the loop is going to be faster and more appropriate. – Kelly Bundy Mar 13 '20 at 18:20
  • Do you have a solution for running the above code faster in a multi core environment? Can you elaborate your solution? – user3089485 Mar 13 '20 at 18:26
  • 6
    `functools.reduce` has nothing to do, implementation-wise, with MapReduce. It's not innately parallel, it's just a loop with incremental application of a function with the accumulated value and each new value from the input iterable. – ShadowRanger Mar 13 '20 at 18:31
  • @HeapOverflow I am trying to use the approach from https://lerner.co.il/2014/05/11/creating-python-dictionaries-reduce/ – user3089485 Mar 13 '20 at 18:36
  • Right, I was hoping to be able to us it with map_partition – user3089485 Mar 13 '20 at 18:37
  • 1
    I don't expect this code to be faster in parallel in python. With the `threading` module it will not be faster due to the GIL. With the `multiprocessing` module it will also probably not be faster due to the data transfer needed between processes. Why not using the `numpy` module here? It could give you a good speed-up compare to this dict-base pure-python algorithm. – Jérôme Richard Mar 13 '20 at 20:07
  • So the purpose here is to populate an `inverted index` dictionary dic that maps tokens to the list of documents. Not sure how I can use `numpy` here. – user3089485 Mar 13 '20 at 20:46
  • 1
    If you want to know how to use `functools.reduce()` you should read the docs, and maybe a few tutorials. As others have already said, this has nothing to do with parallelism. I’m voting to close this question. Please see [tour], [ask], [help/on-topic]. – AMC Mar 14 '20 at 03:22
  • @user3089485 What you are trying to do looks like [this](https://stackoverflow.com/questions/38013778/is-there-any-numpy-group-by-function) where the groupby keys are in `matrix[1]` and the values are in `matrix[0]`. – Jérôme Richard Mar 14 '20 at 17:46

0 Answers0